Commit Graph

48594 Commits

Author SHA1 Message Date
Mathieu Desnoyers
5b25b13ab0 sys_membarrier(): system-wide memory barrier (generic, x86)
Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads running on the system.  It is
implemented by calling synchronize_sched().  It can be used to
distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of
sys_membarrier() and a compiler barrier.  For synchronization primitives
that distinguish between read-side and write-side (e.g.  userspace RCU
[1], rwlocks), the read-side can be accelerated significantly by moving
the bulk of the memory barrier overhead to the write-side.

The existing applications of which I am aware that would be improved by
this system call are as follows:

* Through Userspace RCU library (http://urcu.so)
  - DNS server (Knot DNS) https://www.knot-dns.cz/
  - Network sniffer (http://netsniff-ng.org/)
  - Distributed object storage (https://sheepdog.github.io/sheepdog/)
  - User-space tracing (http://lttng.org)
  - Network storage system (https://www.gluster.org/)
  - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
  - Financial software (https://lkml.org/lkml/2015/3/23/189)

Those projects use RCU in userspace to increase read-side speed and
scalability compared to locking.  Especially in the case of RCU used by
libraries, sys_membarrier can speed up the read-side by moving the bulk of
the memory barrier cost to synchronize_rcu().

* Direct users of sys_membarrier
  - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

Microsoft core dotnet GC developers are planning to use the mprotect()
side-effect of issuing memory barriers through IPIs as a way to implement
Windows FlushProcessWriteBuffers() on Linux.  They are referring to
sys_membarrier in their github thread, specifically stating that
sys_membarrier() is what they are looking for.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu
rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses
with respect to smp_mb() present in Thread B, we can change each
smp_mb() within Thread A into calls to sys_membarrier() and each
smp_mb() within Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
previous mem accesses       previous mem accesses
smp_mb()                    smp_mb()
following mem accesses      following mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by synchronize_sched().

* Benchmarks

On Intel Xeon E5405 (8 cores)
(one thread is calling sys_membarrier, the other 7 threads are busy
looping)

1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invocation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader:    1701557485 reads, 2202847 writes
signal-based scheme:          9830061167 reads,    6700 writes
sys_membarrier:               9952759104 reads,     425 writes
sys_membarrier (dyn. check):  7970328887 reads,     425 writes

The dynamic sys_membarrier availability check adds some overhead to
the read-side compared to the signal-based scheme, but besides that,
sys_membarrier slightly outperforms the signal-based scheme. However,
this non-expedited sys_membarrier implementation has a much slower grace
period than signal and memory barrier schemes.

Besides diminishing the number of wake-ups, one major advantage of the
membarrier system call over the signal-based scheme is that it does not
need to reserve a signal. This plays much more nicely with libraries,
and with processes injected into for tracing purposes, for which we
cannot expect that signals will be unused by the application.

An expedited version of this system call can be added later on to speed
up the grace period. Its implementation will likely depend on reading
the cpu_curr()->mm without holding each CPU's rq lock.

This patch adds the system call to x86 and to asm-generic.

[1] http://urcu.so

membarrier(2) man page:

MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)

NAME
       membarrier - issue memory barriers on a set of threads

SYNOPSIS
       #include <linux/membarrier.h>

       int membarrier(int cmd, int flags);

DESCRIPTION
       The cmd argument is one of the following:

       MEMBARRIER_CMD_QUERY
              Query  the  set  of  supported commands. It returns a bitmask of
              supported commands.

       MEMBARRIER_CMD_SHARED
              Execute a memory barrier on all threads running on  the  system.
              Upon  return from system call, the caller thread is ensured that
              all running threads have passed through a state where all memory
              accesses  to  user-space  addresses  match program order between
              entry to and return from the system  call  (non-running  threads
              are de facto in such a state). This covers threads from all pro=E2=80=90
              cesses running on the system.  This command returns 0.

       The flags argument needs to be 0. For future extensions.

       All memory accesses performed  in  program  order  from  each  targeted
       thread is guaranteed to be ordered with respect to sys_membarrier(). If
       we use the semantic "barrier()" to represent a compiler barrier forcing
       memory  accesses  to  be performed in program order across the barrier,
       and smp_mb() to represent explicit memory barriers forcing full  memory
       ordering  across  the barrier, we have the following ordering table for
       each pair of barrier(), sys_membarrier() and smp_mb():

       The pair ordering is detailed as (O: ordered, X: not ordered):

                              barrier()   smp_mb() sys_membarrier()
              barrier()          X           X            O
              smp_mb()           X           O            O
              sys_membarrier()   O           O            O

RETURN VALUE
       On success, these system calls return zero.  On error, -1 is  returned,
       and errno is set appropriately. For a given command, with flags
       argument set to 0, this system call is guaranteed to always return the
       same value until reboot.

ERRORS
       ENOSYS System call is not implemented.

       EINVAL Invalid arguments.

Linux                             2015-04-15                     MEMBARRIER(2)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nicholas Miell <nmiell@comcast.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11 15:21:34 -07:00
Linus Torvalds
e013f74b60 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph update from Sage Weil:
 "There are a few fixes for snapshot behavior with CephFS and support
  for the new keepalive protocol from Zheng, a libceph fix that affects
  both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya,
  and several small fixes and cleanups from Jianpeng and others"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: improve readahead for file holes
  ceph: get inode size for each append write
  libceph: check data_len in ->alloc_msg()
  libceph: use keepalive2 to verify the mon session is alive
  rbd: plug rbd_dev->header.object_prefix memory leak
  rbd: fix double free on rbd_dev->header_name
  libceph: set 'exists' flag for newly up osd
  ceph: cleanup use of ceph_msg_get
  ceph: no need to get parent inode in ceph_open
  ceph: remove the useless judgement
  ceph: remove redundant test of head->safe and silence static analysis warnings
  ceph: fix queuing inode to mdsdir's snaprealm
  libceph: rename con_work() to ceph_con_workfn()
  libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
  libceph: remove the unused macro AES_KEY_SIZE
  ceph: invalidate dirty pages after forced umount
  ceph: EIO all operations after forced umount
2015-09-11 12:33:03 -07:00
Sagi Grimberg
7f39add3b0 block: Refuse request/bio merges with gaps in the integrity payload
If a driver sets the block queue virtual boundary mask, it means that
it cannot handle gaps so we must not allow those in the integrity
payload as well.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>

Fixed up by me to have duplicate integrity merge functions, depending
on whether block integrity is enabled or not. Fixes a compilations
issue with CONFIG_BLK_DEV_INTEGRITY unset.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-11 09:03:04 -06:00
Rafael J. Wysocki
7c976664d5 Merge branch 'pm-opp'
* pm-opp:
  PM / OPP: Return suspend_opp only if it is enabled
  PM / OPP: add dev_pm_opp_get_suspend_opp() helper
2015-09-11 15:37:17 +02:00
MyungJoo Ham
d54cdf3fc9 PM / devfreq: comments for get_dev_status usage updated
With the introduction of devfreq_update_stats(), governors
are not recommended to use get_dev_status() directly.

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
2015-09-11 14:23:29 +09:00
Javi Merino
08e75e754a PM / devfreq: cache the last call to get_dev_status()
The return value of get_dev_status() can be reused.  Cache it so that
other parts of the kernel can reuse it instead of having to call the
same function again.

Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
2015-09-11 14:23:28 +09:00
Linus Torvalds
b0a1ea51bd Merge branch 'for-4.3/blkcg' of git://git.kernel.dk/linux-block
Pull blk-cg updates from Jens Axboe:
 "A bit later in the cycle, but this has been in the block tree for a a
  while.  This is basically four patchsets from Tejun, that improve our
  buffered cgroup writeback.  It was dependent on the other cgroup
  changes, but they went in earlier in this cycle.

  Series 1 is set of 5 patches that has cgroup writeback updates:

   - bdi_writeback iteration fix which could lead to some wb's being
     skipped or repeated during e.g. sync under memory pressure.

   - Simplification of wb work wait mechanism.

   - Writeback tracepoints updated to report cgroup.

  Series 2 is is a set of updates for the CFQ cgroup writeback handling:

     cfq has always charged all async IOs to the root cgroup.  It didn't
     have much choice as writeback didn't know about cgroups and there
     was no way to tell who to blame for a given writeback IO.
     writeback finally grew support for cgroups and now tags each
     writeback IO with the appropriate cgroup to charge it against.

     This patchset updates cfq so that it follows the blkcg each bio is
     tagged with.  Async cfq_queues are now shared across cfq_group,
     which is per-cgroup, instead of per-request_queue cfq_data.  This
     makes all IOs follow the weight based IO resource distribution
     implemented by cfq.

     - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

     - Other misc review points addressed, acks added and rebased.

  Series 3 is the blkcg policy cleanup patches:

     This patchset contains assorted cleanups for blkcg_policy methods
     and blk[c]g_policy_data handling.

     - alloc/free added for blkg_policy_data.  exit dropped.

     - alloc/free added for blkcg_policy_data.

     - blk-throttle's async percpu allocation is replaced with direct
       allocation.

     - all methods now take blk[c]g_policy_data instead of blkcg_gq or
       blkcg.

  And finally, series 4 is a set of patches cleaning up the blkcg stats
  handling:

    blkcg's stats have always been somwhat of a mess.  This patchset
    tries to improve the situation a bit.

     - The following patches added to consolidate blkcg entry point and
       blkg creation.  This is in itself is an improvement and helps
       colllecting common stats on bio issue.

     - per-blkg stats now accounted on bio issue rather than request
       completion so that bio based and request based drivers can behave
       the same way.  The issue was spotted by Vivek.

     - cfq-iosched implements custom recursive stats and blk-throttle
       implements custom per-cpu stats.  This patchset make blkcg core
       support both by default.

     - cfq-iosched and blk-throttle keep track of the same stats
       multiple times.  Unify them"

* 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
  blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
  blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
  blkcg: implement interface for the unified hierarchy
  blkcg: misc preparations for unified hierarchy interface
  blkcg: separate out tg_conf_updated() from tg_set_conf()
  blkcg: move body parsing from blkg_conf_prep() to its callers
  blkcg: mark existing cftypes as legacy
  blkcg: rename subsystem name from blkio to io
  blkcg: refine error codes returned during blkcg configuration
  blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
  blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
  blkcg: remove cfqg_stats->sectors
  blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
  blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
  blkcg: make blkcg_[rw]stat per-cpu
  blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
  blkcg: consolidate blkg creation in blkcg_bio_issue_check()
  blk-throttle: improve queue bypass handling
  blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
  blkcg: inline [__]blkg_lookup()
  ...
2015-09-10 18:56:14 -07:00
Linus Torvalds
33e247c7e5 Merge branch 'akpm' (patches from Andrew)
Merge third patch-bomb from Andrew Morton:

 - even more of the rest of MM

 - lib/ updates

 - checkpatch updates

 - small changes to a few scruffy filesystems

 - kmod fixes/cleanups

 - kexec updates

 - a dma-mapping cleanup series from hch

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits)
  dma-mapping: consolidate dma_set_mask
  dma-mapping: consolidate dma_supported
  dma-mapping: cosolidate dma_mapping_error
  dma-mapping: consolidate dma_{alloc,free}_noncoherent
  dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}
  mm: use vma_is_anonymous() in create_huge_pmd() and wp_huge_pmd()
  mm: make sure all file VMAs have ->vm_ops set
  mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()
  mm: mark most vm_operations_struct const
  namei: fix warning while make xmldocs caused by namei.c
  ipc: convert invalid scenarios to use WARN_ON
  zlib_deflate/deftree: remove bi_reverse()
  lib/decompress_unlzma: Do a NULL check for pointer
  lib/decompressors: use real out buf size for gunzip with kernel
  fs/affs: make root lookup from blkdev logical size
  sysctl: fix int -> unsigned long assignments in INT_MIN case
  kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo
  kexec: align crash_notes allocation to make it be inside one physical page
  kexec: remove unnecessary test in kimage_alloc_crash_control_pages()
  kexec: split kexec_load syscall from kexec core code
  ...
2015-09-10 18:19:42 -07:00
Linus Torvalds
d71fc239b6 Merge tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull late ARM SoC updates from Kevin Hilman:
 "This is a collection of a few late fixes and other misc stuff that had
  dependencies on things being merged from other trees.

  The bulk of the changes are for samsung/exynos SoCs for some changes
  that needed a few minor reworks so ended up a bit late.  The others
  are mainly for qcom SoCs: a couple fixes and some DTS updates"

* tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (37 commits)
  ARM: multi_v7_defconfig: Enable PBIAS regulator
  soc: qcom: smd: Correct fBLOCKREADINTR handling
  soc: qcom: smd: Use correct remote processor ID
  soc: qcom: smem: Fix errant private access
  ARM: dts: qcom: msm8974-sony-xperia-honami: Use stdout-path
  ARM: dts: qcom: msm8960-cdp: Use stdout-path
  ARM: dts: qcom: msm8660-surf: Use stdout-path
  ARM: dts: qcom: ipq8064-ap148: Use stdout-path
  ARM: dts: qcom: apq8084-mtp: Use stdout-path
  ARM: dts: qcom: apq8084-ifc6540: Use stdout-path
  ARM: dts: qcom: apq8074-dragonboard: Use stdout-path
  ARM: dts: qcom: apq8064-ifc6410: Use stdout-path
  ARM: dts: qcom: apq8064-cm-qs600: Use stdout-path
  ARM: dts: qcom: Label serial nodes for aliasing and stdout-path
  reset: ath79: Fix missing spin_lock_init
  reset: Add (devm_)reset_control_get stub functions
  ARM: EXYNOS: switch to using generic cpufreq driver for exynos4x12
  cpufreq: exynos: Remove unselectable rule for arm-exynos-cpufreq.o
  ARM: dts: add iommu property to JPEG device for exynos4
  ARM: dts: enable SPI1 for exynos4412-odroidu3
  ...
2015-09-10 17:59:04 -07:00
Linus Torvalds
519f526d39 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
 "ARM:
   - Full debug support for arm64
   - Active state switching for timer interrupts
   - Lazy FP/SIMD save/restore for arm64
   - Generic ARMv8 target

  PPC:
   - Book3S: A few bug fixes
   - Book3S: Allow micro-threading on POWER8

  x86:
   - Compiler warnings

  Generic:
   - Adaptive polling for guest halt"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits)
  kvm: irqchip: fix memory leak
  kvm: move new trace event outside #ifdef CONFIG_KVM_ASYNC_PF
  KVM: trace kvm_halt_poll_ns grow/shrink
  KVM: dynamic halt-polling
  KVM: make halt_poll_ns per-vCPU
  Silence compiler warning in arch/x86/kvm/emulate.c
  kvm: compile process_smi_save_seg_64() only for x86_64
  KVM: x86: avoid uninitialized variable warning
  KVM: PPC: Book3S: Fix typo in top comment about locking
  KVM: PPC: Book3S: Fix size of the PSPB register
  KVM: PPC: Book3S HV: Exit on H_DOORBELL if HOST_IPI is set
  KVM: PPC: Book3S HV: Fix race in starting secondary threads
  KVM: PPC: Book3S: correct width in XER handling
  KVM: PPC: Book3S HV: Fix preempted vcore stolen time calculation
  KVM: PPC: Book3S HV: Fix preempted vcore list locking
  KVM: PPC: Book3S HV: Implement H_CLEAR_REF and H_CLEAR_MOD
  KVM: PPC: Book3S HV: Fix bug in dirty page tracking
  KVM: PPC: Book3S HV: Fix race in reading change bit when removing HPTE
  KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8
  KVM: PPC: Book3S HV: Make use of unused threads when running guests
  ...
2015-09-10 16:42:49 -07:00
Linus Torvalds
65c61bc5db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix out-of-bounds array access in netfilter ipset, from Jozsef
    Kadlecsik.

 2) Use correct free operation on netfilter conntrack templates, from
    Daniel Borkmann.

 3) Fix route leak in SCTP, from Marcelo Ricardo Leitner.

 4) Fix sizeof(pointer) in mac80211, from Thierry Reding.

 5) Fix cache pointer comparison in ip6mr leading to missed unlock of
    mrt_lock.  From Richard Laing.

 6) rds_conn_lookup() needs to consider network namespace in key
    comparison, from Sowmini Varadhan.

 7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov
    Dmitriy.

 8) Fix fd leaks in bpf syscall, from Daniel Borkmann.

 9) Fix error recovery when installing ipv6 multipath routes, we would
    delete the old route before we would know if we could fully commit
    to the new set of nexthops.  Fix from Roopa Prabhu.

10) Fix run-time suspend problems in r8152, from Hayes Wang.

11) In fec, don't program the MAC address into the chip when the clocks
    are gated off.  From Fugang Duan.

12) Fix poll behavior for netlink sockets when using rx ring mmap, from
    Daniel Borkmann.

13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169
    driver, from Corinna Vinschen.

14) In TCP Cubic congestion control, handle idle periods better where we
    are application limited, in order to keep cwnd from growing out of
    control.  From Eric Dumzet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
  tcp_cubic: better follow cubic curve after idle period
  tcp: generate CA_EVENT_TX_START on data frames
  xen-netfront: respect user provided max_queues
  xen-netback: respect user provided max_queues
  r8169: Fix sleeping function called during get_stats64, v2
  ether: add IEEE 1722 ethertype - TSN
  netlink, mmap: fix edge-case leakages in nf queue zero-copy
  netlink, mmap: don't walk rx ring on poll if receive queue non-empty
  cxgb4: changes for new firmware 1.14.4.0
  net: fec: add netif status check before set mac address
  r8152: fix the runtime suspend issues
  r8152: split DRIVER_VERSION
  ipv6: fix ifnullfree.cocci warnings
  add microchip LAN88xx phy driver
  stmmac: fix check for phydev being open
  net: qlcnic: delete redundant memsets
  net: mv643xx_eth: use kzalloc
  net: jme: use kzalloc() instead of kmalloc+memset
  net: cavium: liquidio: use kzalloc in setup_glist()
  net: ipv6: use common fib_default_rule_pref
  ...
2015-09-10 13:53:15 -07:00
Oleg Nesterov
1fcfd8db7f mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()
Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(),
rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple
wrapper on top of do_mmap().  Perhaps we should update the callers of
do_mmap_pgoff() and kill it later.

This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not
play with vm internals.

After this change mmap_region() has a single user outside of mmap.c,
arch/tile/mm/elf.c:arch_setup_additional_pages().  It would be nice to
change arch/tile/ and unexport mmap_region().

[kirill@shutemov.name: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Dave Young
2965faa5e0 kexec: split kexec_load syscall from kexec core code
There are two kexec load syscalls, kexec_load another and kexec_file_load.
 kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Dave Young
a43cac0d9d kexec: split kexec_file syscall code to kexec_file.c
Split kexec_file syscall related code to another file kernel/kexec_file.c
so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.

Sharing variables and functions are moved to kernel/kexec_internal.h per
suggestion from Vivek and Petr.

[akpm@linux-foundation.org: fix bisectability]
[akpm@linux-foundation.org: declare the various arch_kexec functions]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Andy Shevchenko
37607102c4 seq_file: provide an analogue of print_hex_dump()
This introduces a new helper and switches current users to use it.  All
patches are compiled tested. kmemleak is tested via its own test suite.

This patch (of 6):

The new seq_hex_dump() is a complete analogue of print_hex_dump().

We have few users of this functionality already. It allows to reduce their
codebase.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Cc: Tadeusz Struk <tadeusz.struk@intel.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Tuchscherer <ingo.tuchscherer@de.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vladimir Kondratiev <qca_vkondrat@qca.qualcomm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Frederic Weisbecker
90f023030e kmod: use system_unbound_wq instead of khelper
We need to launch the usermodehelper kernel threads with the widest
affinity and this is partly why we use khelper.  This workqueue has
unbound properties and thus a wide affinity inherited by all its children.

Now khelper also has special properties that we aren't much interested in:
ordered and singlethread.  There is really no need about ordering as all
we do is creating kernel threads.  This can be done concurrently.  And
singlethread is a useless limitation as well.

The workqueue engine already proposes generic unbound workqueues that
don't share these useless properties and handle well parallel jobs.

The only worrysome specific is their affinity to the node of the current
CPU.  It's fine for creating the usermodehelper kernel threads but those
inherit this affinity for longer jobs such as requesting modules.

This patch proposes to use these node affine unbound workqueues assuming
that a node is sufficient to handle several parallel usermodehelper
requests.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Kees Cook
b40bdb7fb2 lib/string_helpers: rename "esc" arg to "only"
To further clarify the purpose of the "esc" argument, rename it to "only"
to reflect that it is a limit, not a list of additional characters to
escape.

Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Linus Walleij
cdf17449af hexdump: do not print debug dumps for !CONFIG_DEBUG
print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or
dev_dbg() & friends.  Currently it will adhere to dynamic debug, but will
not stub out prints if CONFIG_DEBUG is not set.  Let's make it do the
right thing, because I am tired of having my dmesg buffer full of hex
dumps on production systems.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Jason A. Donenfeld
515a9adce0 include/linux/printk.h: include pr_fmt in pr_debug_ratelimited
The other two implementations of pr_debug_ratelimited include pr_fmt,
along with every other pr_* function.  But pr_debug_ratelimited forgot to
add it with the CONFIG_DYNAMIC_DEBUG implementation.

This patch unifies the behavior.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Vasily Kulikov
8b839635e7 include/linux/poison.h: remove not-used poison pointer macros
Signed-off-by: Vasily Kulikov <segoon@openwall.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Vasily Kulikov
8a5e5e02fc include/linux/poison.h: fix LIST_POISON{1,2} offset
Poison pointer values should be small enough to find a room in
non-mmap'able/hardly-mmap'able space.  E.g.  on x86 "poison pointer space"
is located starting from 0x0.  Given unprivileged users cannot mmap
anything below mmap_min_addr, it should be safe to use poison pointers
lower than mmap_min_addr.

The current poison pointer values of LIST_POISON{1,2} might be too big for
mmap_min_addr values equal or less than 1 MB (common case, e.g.  Ubuntu
uses only 0x10000).  There is little point to use such a big value given
the "poison pointer space" below 1 MB is not yet exhausted.  Changing it
to a smaller value solves the problem for small mmap_min_addr setups.

The values are suggested by Solar Designer:
http://www.openwall.com/lists/oss-security/2015/05/02/6

Signed-off-by: Vasily Kulikov <segoon@openwall.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Vladimir Davydov
33c3fc71c8 mm: introduce idle page tracking
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g.  by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.  However,
this method has two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g.  by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.

The Young page flag is used to avoid interference with the memory
reclaimer.  A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.

Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Vladimir Davydov
1d7715c676 mmu-notifier: add clear_young callback
In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not only
in primary, but also in secondary ptes.  The latter is required in order
to estimate wss of KVM VMs.  At the same time we want to avoid flushing
tlb, because it is quite expensive and it won't really affect the final
result.

Currently, there is no function for clearing pte young bit that would meet
our requirements, so this patch introduces one.  To achieve that we have
to add a new mmu-notifier callback, clear_young, since there is no method
for testing-and-clearing a secondary pte w/o flushing tlb.  The new method
is not mandatory and currently only implemented by KVM.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Vladimir Davydov
e993d905c8 memcg: zap try_get_mem_cgroup_from_page
It is only used in mem_cgroup_try_charge, so fold it in and zap it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Vladimir Davydov
2fc0452470 memcg: add page_cgroup_ino helper
This patchset introduces a new user API for tracking user memory pages
that have not been used for a given period of time.  The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e.  the set of pages that are actively used by the
workload.  Knowing the working set size can be useful for partitioning the
system more efficiently, e.g.  by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

==== USE CASES ====

The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size.  However, the working set size of a workload may be unknown or
change in time.  With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.

Another use case is balancing workloads within a compute cluster.  Knowing
how much memory is not really used by a workload unit may help take a more
optimal decision when considering migrating the unit to another node
within the cluster.

Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.

==== USER API ====

The user API consists of two new files:

 * /sys/kernel/mm/page_idle/bitmap.  This file implements a bitmap where each
   bit corresponds to a page, indexed by PFN. When the bit is set, the
   corresponding page is idle. A page is considered idle if it has not been
   accessed since it was marked idle. To mark a page idle one should set the
   bit corresponding to the page by writing to the file. A value written to the
   file is OR-ed with the current bitmap value. Only user memory pages can be
   marked idle, for other page types input is silently ignored. Writing to this
   file beyond max PFN results in the ENXIO error. Only available when
   CONFIG_IDLE_PAGE_TRACKING is set.

   This file can be used to estimate the amount of pages that are not
   used by a particular workload as follows:

   1. mark all pages of interest idle by setting corresponding bits in the
      /sys/kernel/mm/page_idle/bitmap
   2. wait until the workload accesses its working set
   3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set

 * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
   memory cgroup each page is charged to, indexed by PFN. Only available when
   CONFIG_MEMCG is set.

   This file can be used to find all pages (including unmapped file pages)
   accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
   can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

==== REASONING ====

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 6.

==== PATCHSET STRUCTURE ====

The patch set is organized as follows:

 - patch 1 adds page_cgroup_ino() helper for the sake of
   /proc/kpagecgroup and patches 2-3 do related cleanup
 - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
   charged to
 - patch 5 introduces a new mmu notifier callback, clear_young, which is
   a lightweight version of clear_flush_young; it is used in patch 6
 - patch 6 implements the idle page tracking feature, including the
   userspace API, /sys/kernel/mm/page_idle/bitmap
 - patch 7 exports idle flag via /proc/kpageflags

==== SIMILAR WORKS ====

Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/).  The main
difference between Michel's patch and this one is that Michel implemented
a kernel space daemon for estimating idle memory size per cgroup while
this patch only provides the userspace with the minimal API for doing the
job, leaving the rest up to the userspace.  However, they both share the
same idea of Idle/Young page flags to avoid affecting the reclaimer logic.

==== PERFORMANCE EVALUATION ====

SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
performance impact introduced by this patch set.  Three runs were carried
out:

 - base: kernel without the patch
 - patched: patched kernel, the feature is not used
 - patched-active: patched kernel, 1 minute-period daemon is used for
   tracking idle memory

For tracking idle memory, idlememstat utility was used:
https://github.com/locker/idlememstat

testcase            base            patched        patched-active

compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%

composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%

time idlememstat:

17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
448inputs+40outputs (1major+36052minor)pagefaults 0swaps

==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
#! /usr/bin/python
#

import os
import stat
import errno
import struct

CGROUP_MOUNT = "/sys/fs/cgroup/memory"
BUFSIZE = 8 * 1024  # must be multiple of 8

def get_hugepage_size():
    with open("/proc/meminfo", "r") as f:
        for s in f:
            k, v = s.split(":")
            if k == "Hugepagesize":
                return int(v.split()[0]) * 1024

PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
HUGEPAGE_SIZE = get_hugepage_size()

def set_idle():
    f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
    while True:
        try:
            f.write(struct.pack("Q", pow(2, 64) - 1))
        except IOError as err:
            if err.errno == errno.ENXIO:
                break
            raise
    f.close()

def count_idle():
    f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
    f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)

    with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
        while f.read(BUFSIZE): pass  # update idle flag

    idlememsz = {}
    while True:
        s1, s2 = f_flags.read(8), f_cgroup.read(8)
        if not s1 or not s2:
            break

        flags, = struct.unpack('Q', s1)
        cgino, = struct.unpack('Q', s2)

        unevictable = (flags >> 18) & 1
        huge = (flags >> 22) & 1
        idle = (flags >> 25) & 1

        if idle and not unevictable:
            idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                (HUGEPAGE_SIZE if huge else PAGE_SIZE)

    f_flags.close()
    f_cgroup.close()
    return idlememsz

if __name__ == "__main__":
    print "Setting the idle flag for each page..."
    set_idle()

    raw_input("Wait until the workload accesses its working set, "
              "then press Enter")

    print "Counting idle pages..."
    idlememsz = count_idle()

    for dir, subdirs, files in os.walk(CGROUP_MOUNT):
        ino = os.stat(dir)[stat.ST_INO]
        print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
==== END SCRIPT ====

This patch (of 8):

Add page_cgroup_ino() helper to memcg.

This function returns the inode number of the closest online ancestor of
the memory cgroup a page is charged to.  It is required for exporting
information about which page is charged to which cgroup to userspace,
which will be introduced by a following patch.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Dan Streetman
3f0e131221 zpool: add zpool_has_pool()
This series makes creation of the zpool and compressor dynamic, so that
they can be changed at runtime.  This makes using/configuring zswap
easier, as before this zswap had to be configured at boot time, using boot
params.

This uses a single list to track both the zpool and compressor together,
although Seth had mentioned an alternative which is to track the zpools
and compressors using separate lists.  In the most common case, only a
single zpool and single compressor, using one list is slightly simpler
than using two lists, and for the uncommon case of multiple zpools and/or
compressors, using one list is slightly less simple (and uses slightly
more memory, probably) than using two lists.

This patch (of 4):

Add zpool_has_pool() function, indicating if the specified type of zpool
is available (i.e.  zsmalloc or zbud).  This allows checking if a pool is
available, without actually trying to allocate it, similar to
crypto_has_alg().

This is used by a following patch to zswap that enables the dynamic
runtime creation of zswap zpools.

Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Seth Jennings <sjennings@variantweb.net>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Chris Metcalf
30035e4575 string: provide strscpy()
The strscpy() API is intended to be used instead of strlcpy(),
and instead of most uses of strncpy().

- Unlike strlcpy(), it doesn't read from memory beyond (src + size).

- Unlike strlcpy() or strncpy(), the API provides an easy way to check
  for destination buffer overflow: an -E2BIG error return value.

- The provided implementation is robust in the face of the source
  buffer being asynchronously changed during the copy, unlike the
  current implementation of strlcpy().

- Unlike strncpy(), the destination buffer will be NUL-terminated
  if the string in the source buffer is too long.

- Also unlike strncpy(), the destination buffer will not be updated
  beyond the NUL termination, avoiding strncpy's behavior of zeroing
  the entire tail end of the destination buffer.  (A memset() after
  the strscpy() can be used if this behavior is desired.)

- The implementation should be reasonably performant on all
  platforms since it uses the asm/word-at-a-time.h API rather than
  simple byte copy.  Kernel-to-kernel string copy is not considered
  to be performance critical in any case.

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
2015-09-10 15:36:59 -04:00
Daniel Borkmann
6bb0fef489 netlink, mmap: fix edge-case leakages in nf queue zero-copy
When netlink mmap on receive side is the consumer of nf queue data,
it can happen that in some edge cases, we write skb shared info into
the user space mmap buffer:

Assume a possible rx ring frame size of only 4096, and the network skb,
which is being zero-copied into the netlink skb, contains page frags
with an overall skb->len larger than the linear part of the netlink
skb.

skb_zerocopy(), which is generic and thus not aware of the fact that
shared info cannot be accessed for such skbs then tries to write and
fill frags, thus leaking kernel data/pointers and in some corner cases
possibly writing out of bounds of the mmap area (when filling the
last slot in the ring buffer this way).

I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has
an advertised length larger than 4096, where the linear part is visible
at the slot beginning, and the leaked sizeof(struct skb_shared_info)
has been written to the beginning of the next slot (also corrupting
the struct nl_mmap_hdr slot header incl. status etc), since skb->end
points to skb->data + ring->frame_size - NL_MMAP_HDRLEN.

The fix adds and lets __netlink_alloc_skb() take the actual needed
linear room for the network skb + meta data into account. It's completely
irrelevant for non-mmaped netlink sockets, but in case mmap sockets
are used, it can be decided whether the available skb_tailroom() is
really large enough for the buffer, or whether it needs to internally
fallback to a normal alloc_skb().

>From nf queue side, the information whether the destination port is
an mmap RX ring is not really available without extra port-to-socket
lookup, thus it can only be determined in lower layers i.e. when
__netlink_alloc_skb() is called that checks internally for this. I
chose to add the extra ldiff parameter as mmap will then still work:
We have data_len and hlen in nfqnl_build_packet_message(), data_len
is the full length (capped at queue->copy_range) for skb_zerocopy()
and hlen some possible part of data_len that needs to be copied; the
rem_len variable indicates the needed remaining linear mmap space.

The only other workaround in nf queue internally would be after
allocation time by f.e. cap'ing the data_len to the skb_tailroom()
iff we deal with an mmap skb, but that would 1) expose the fact that
we use a mmap skb to upper layers, and 2) trim the skb where we
otherwise could just have moved the full skb into the normal receive
queue.

After the patch, in my test case the ring slot doesn't fit and therefore
shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and
thus needs to be picked up via recv().

Fixes: 3ab1f683bf ("nfnetlink: add support for memory mapped netlink")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09 21:43:22 -07:00
Woojung.Huh@microchip.com
792aec47d5 add microchip LAN88xx phy driver
Add Microchip LAN88XX phy driver for phylib.

Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09 17:19:14 -07:00
Kevin Hilman
c6e59bdac9 Merge tag 'qcom-soc-for-4.3-rc2' of git://codeaurora.org/quic/kernel/agross-msm into next/late
Qualcomm ARM Based SoC Updates for 4.3-rc2

* Fix errant private access in SMEM
* Fix use of correct remote processor ID in SMD transactions
* Correct SMD fBLOCKREADINTR handling

* tag 'qcom-soc-for-4.3-rc2' of git://codeaurora.org/quic/kernel/agross-msm:
  soc: qcom: smd: Correct fBLOCKREADINTR handling
  soc: qcom: smd: Use correct remote processor ID
  soc: qcom: smem: Fix errant private access
  devicetree: soc: Add Qualcomm SMD based RPM DT binding
  soc: qcom: Driver for the Qualcomm RPM over SMD
  soc: qcom: Add Shared Memory Driver
  soc: qcom: Add device tree binding for Shared Memory Device
  drivers: qcom: Select QCOM_SCM unconditionally for QCOM_PM
  soc: qcom: Add Shared Memory Manager driver
2015-09-09 16:15:34 -07:00
Andy Gross
61e19ba99e Merge tag 'qcom-soc-for-4.3' into v4.2-rc2
Qualcomm ARM Based SoC Updates for 4.3

* Add SMEM driver
* Add SMD driver
* Add RPM over SMD driver
* Select QCOM_SCM by default
2015-09-09 15:56:35 -05:00
Linus Torvalds
b8889c4fc6 Merge tag 'tty-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty driver reverts from Greg KH:
 "Here are some reverts for some tty patches (specifically the pl011
  driver) that ended up breaking a bunch of machines (i.e. almost all
  of the ones with this chip).

  People are working on a fix for this, but in the meantime, it's best
  to just revert all 5 patches to restore people's serial consoles.

  These reverts have been in linux-next for many days now"

* tag 'tty-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "uart: pl011: Rename regs with enumeration"
  Revert "uart: pl011: Introduce register accessor"
  Revert "uart: pl011: Introduce register look up table"
  Revert "uart: pl011: Improve LCRH register access decision"
  Revert "uart: pl011: Add support to ZTE ZX296702 uart"
2015-09-09 11:27:01 -07:00
Linus Torvalds
82278fc079 Merge tag 'pwm/for-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
Pull pwm updates from Thierry Reding:
 "This set of changes introduces the beginnings of a new API that's
  based around the concept of states that can be atomically applied.
  Drivers go to various lengths to implement something similar, which
  indicates that the core should really be providing the necessary
  framework.

  On top of that, there is a bit of cleanup as well as improved
  kerneldoc and integration into the device-drivers DocBook.

  Regarding drivers there is a new one for the NXP LPC18xx family of
  SoCs and a couple of fixes for existing drivers (pca9685, Broadcom
  Kona and Atmel HLCDC)"

* tag 'pwm/for-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
  ARM: at91: pwm: atmel-hlcdc: Add at91sam9n12 errata
  pwm: Add NXP LPC18xx PWM/SCT DT binding documentation
  pwm: NXP LPC18xx PWM/SCT driver
  pwm-pca9685: Support changing the output frequency
  pwm-pca9685: Fix several driver bugs
  pwm: kona: Modify settings application sequence
  pwm: pca9685: Drop owner assignment
  pwm: Add to device-drivers documentation
  pwm: Clean up kerneldoc
  pwm: Remove useless whitespace
  pwm: sysfs: Remove unnecessary padding
  pwm: sysfs: Properly convert from enum to string
  pwm: Make use of pwm_get_xxx() helpers where appropriate
  pwm: Add pwm_get_polarity() helper function
  pwm: Constify PWM device where possible
  pwm: Add the pwm_is_enabled() helper
2015-09-09 10:55:32 -07:00
Linus Torvalds
26d2177e97 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull inifiniband/rdma updates from Doug Ledford:
 "This is a fairly sizeable set of changes.  I've put them through a
  decent amount of testing prior to sending the pull request due to
  that.

  There are still a few fixups that I know are coming, but I wanted to
  go ahead and get the big, sizable chunk into your hands sooner rather
  than waiting for those last few fixups.

  Of note is the fact that this creates what is intended to be a
  temporary area in the drivers/staging tree specifically for some
  cleanups and additions that are coming for the RDMA stack.  We
  deprecated two drivers (ipath and amso1100) and are waiting to hear
  back if we can deprecate another one (ehca).  We also put Intel's new
  hfi1 driver into this area because it needs to be refactored and a
  transfer library created out of the factored out code, and then it and
  the qib driver and the soft-roce driver should all be modified to use
  that library.

  I expect drivers/staging/rdma to be around for three or four kernel
  releases and then to go away as all of the work is completed and final
  deletions of deprecated drivers are done.

  Summary of changes for 4.3:

   - Create drivers/staging/rdma
   - Move amso1100 driver to staging/rdma and schedule for deletion
   - Move ipath driver to staging/rdma and schedule for deletion
   - Add hfi1 driver to staging/rdma and set TODO for move to regular
     tree
   - Initial support for namespaces to be used on RDMA devices
   - Add RoCE GID table handling to the RDMA core caching code
   - Infrastructure to support handling of devices with differing read
     and write scatter gather capabilities
   - Various iSER updates
   - Kill off unsafe usage of global mr registrations
   - Update SRP driver
   - Misc  mlx4 driver updates
   - Support for the mr_alloc verb
   - Support for a netlink interface between kernel and user space cache
     daemon to speed path record queries and route resolution
   - Ininitial support for safe hot removal of verbs devices"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
  IB/ipoib: Suppress warning for send only join failures
  IB/ipoib: Clean up send-only multicast joins
  IB/srp: Fix possible protection fault
  IB/core: Move SM class defines from ib_mad.h to ib_smi.h
  IB/core: Remove unnecessary defines from ib_mad.h
  IB/hfi1: Add PSM2 user space header to header_install
  IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
  mlx5: Fix incorrect wc pkey_index assignment for GSI messages
  IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
  IB/uverbs: reject invalid or unknown opcodes
  IB/cxgb4: Fix if statement in pick_local_ip6adddrs
  IB/sa: Fix rdma netlink message flags
  IB/ucma: HW Device hot-removal support
  IB/mlx4_ib: Disassociate support
  IB/uverbs: Enable device removal when there are active user space applications
  IB/uverbs: Explicitly pass ib_dev to uverbs commands
  IB/uverbs: Fix race between ib_uverbs_open and remove_one
  IB/uverbs: Fix reference counting usage of event files
  IB/core: Make ib_dealloc_pd return void
  IB/srp: Create an insecure all physical rkey only if needed
  ...
2015-09-09 08:33:31 -07:00
Linus Torvalds
a794b4f329 Merge tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi
Pull IPMI updates from Corey Minyard:
 "Most of these have been sitting in linux-next for more than a release,
  particularly commit 0fbcf4af7c ("ipmi: Convert the IPMI SI ACPI
  handling to a platform device") which is probably the most complex
  patch.

  That is also the one that changes drivers/acpi/acpi_pnp.c.  The change
  in that file is only removing IPMI from a "special platform devices"
  list, since I convert it to the standard PNP interface.  I posted this
  one to the ACPI list twice and got no response, and it seems to work
  well in my testing, so I'm hoping it's good.

  Hidehiro Kawai posted a set of changes that improves the panic time
  handling in the IPMI driver.

  The rest of the changes are minor bug fixes or cleanups and some
  documentation"

* tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi:
  ipmi:ssif: Add a module parm to specify that SMBus alerts don't work
  ipmi: add of_device_id in MODULE_DEVICE_TABLE
  ipmi: Compensate for BMCs that wont set the irq enable bit
  ipmi: Don't call receive handler in the panic context
  ipmi: Avoid touching possible corrupted lists in the panic context
  ipmi: Don't flush messages in sender() in run-to-completion mode
  ipmi: Factor out message flushing procedure
  ipmi: Remove unneeded set_run_to_completion call
  ipmi: Make some data const that was only read
  ipmi: constify SSIF ACPI device ids
  ipmi: Delete an unnecessary check before the function call "cleanup_one_si"
  char:ipmi - Change 1 to true for bool type variables during initialization.
  impi:Remove unneeded setting of module owner to THIS_MODULE in the platform structure, powernv_ipmi_driver
  ipmi: Add a comment in how messages are delivered from the lower layer
  ipmi/powernv: Fix potential invalid pointer dereference
  ipmi: Convert the IPMI SI ACPI handling to a platform device
  ipmi: Add device tree bindings information
2015-09-08 18:19:17 -07:00
Linus Torvalds
f6f7a63692 Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:
 "Almost all of the rest of MM.  There was an unusually large amount of
  MM material this time"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
  zpool: remove no-op module init/exit
  mm: zbud: constify the zbud_ops
  mm: zpool: constify the zpool_ops
  mm: swap: zswap: maybe_preload & refactoring
  zram: unify error reporting
  zsmalloc: remove null check from destroy_handle_cache()
  zsmalloc: do not take class lock in zs_shrinker_count()
  zsmalloc: use class->pages_per_zspage
  zsmalloc: consider ZS_ALMOST_FULL as migrate source
  zsmalloc: partial page ordering within a fullness_list
  zsmalloc: use shrinker to trigger auto-compaction
  zsmalloc: account the number of compacted pages
  zsmalloc/zram: introduce zs_pool_stats api
  zsmalloc: cosmetic compaction code adjustments
  zsmalloc: introduce zs_can_compact() function
  zsmalloc: always keep per-class stats
  zsmalloc: drop unused variable `nr_to_migrate'
  mm/memblock.c: fix comment in __next_mem_range()
  mm/page_alloc.c: fix type information of memoryless node
  memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
  ...
2015-09-08 17:52:23 -07:00
Linus Torvalds
9a9952bbd7 Merge tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates for from Joerg Roedel:
 "This time the IOMMU updates are mostly cleanups or fixes.  No big new
  features or drivers this time.  In particular the changes include:

   - Bigger cleanup of the Domain<->IOMMU data structures and the code
     that manages them in the Intel VT-d driver.  This makes the code
     easier to understand and maintain, and also easier to keep the data
     structures in sync.  It is also a preparation step to make use of
     default domains from the IOMMU core in the Intel VT-d driver.

   - Fixes for a couple of DMA-API misuses in ARM IOMMU drivers, namely
     in the ARM and Tegra SMMU drivers.

   - Fix for a potential buffer overflow in the OMAP iommu driver's
     debug code

   - A couple of smaller fixes and cleanups in various drivers

   - One small new feature: Report domain-id usage in the Intel VT-d
     driver to easier detect bugs where these are leaked"

* tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (83 commits)
  iommu/vt-d: Really use upper context table when necessary
  x86/vt-d: Fix documentation of DRHD
  iommu/fsl: Really fix init section(s) content
  iommu/io-pgtable-arm: Unmap and free table when overwriting with block
  iommu/io-pgtable-arm: Move init-fn declarations to io-pgtable.h
  iommu/msm: Use BUG_ON instead of if () BUG()
  iommu/vt-d: Access iomem correctly
  iommu/vt-d: Make two functions static
  iommu/vt-d: Use BUG_ON instead of if () BUG()
  iommu/vt-d: Return false instead of 0 in irq_remapping_cap()
  iommu/amd: Use BUG_ON instead of if () BUG()
  iommu/amd: Make a symbol static
  iommu/amd: Simplify allocation in irq_remapping_alloc()
  iommu/tegra-smmu: Parameterize number of TLB lines
  iommu/tegra-smmu: Factor out tegra_smmu_set_pde()
  iommu/tegra-smmu: Extract tegra_smmu_pte_get_use()
  iommu/tegra-smmu: Use __GFP_ZERO to allocate zeroed pages
  iommu/tegra-smmu: Remove PageReserved manipulation
  iommu/tegra-smmu: Convert to use DMA API
  iommu/tegra-smmu: smmu_flush_ptc() wants device addresses
  ...
2015-09-08 17:22:35 -07:00
Bartlomiej Zolnierkiewicz
4eafbd15b6 PM / OPP: add dev_pm_opp_get_suspend_opp() helper
Add dev_pm_opp_get_suspend_opp() helper to obtain suspend opp.

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-09-09 02:20:39 +02:00
Linus Torvalds
e81b594cda Merge tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap updates from Mark Brown:
 "This has been a busy release for regmap.

  By far the biggest set of changes here are those from Markus Pargmann
  which implement support for block transfers in smbus devices.  This
  required quite a bit of refactoring but leaves us better able to
  handle odd restrictions that controllers may have and with better
  performance on smbus.

  Other new features include:

   - Fix interactions with lockdep for nested regmaps (eg, when a device
     using regmap is connected to a bus where the bus controller has a
     separate regmap).  Lockdep's default class identification is too
     crude to work without help.

   - Support for must write bitfield operations, useful for operations
     which require writing a bit to trigger them from Kuniori Morimoto.

   - Support for delaying during register patch application from Nariman
     Poushin.

   - Support for overriding cache state via the debugfs implementation
     from Richard Fitzgerald"

* tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (25 commits)
  regmap: fix a NULL pointer dereference in __regmap_init
  regmap: Support bulk reads for devices without raw formatting
  regmap-i2c: Add smbus i2c block support
  regmap: Add raw_write/read checks for max_raw_write/read sizes
  regmap: regmap max_raw_read/write getter functions
  regmap: Introduce max_raw_read/write for regmap_bulk_read/write
  regmap: Add missing comments about struct regmap_bus
  regmap: No multi_write support if bus->write does not exist
  regmap: Split use_single_rw internally into use_single_read/write
  regmap: Fix regmap_bulk_write for bus writes
  regmap: regmap_raw_read return error on !bus->read
  regulator: core: Print at debug level on debugfs creation failure
  regmap: Fix regmap_can_raw_write check
  regmap: fix typos in regmap.c
  regmap: Fix integertypes for register address and value
  regmap: Move documentation to regmap.h
  regmap: Use different lockdep class for each regmap init call
  thermal: sti: Add parentheses around bridge->ops->regmap_init call
  mfd: vexpress: Add parentheses around bridge->ops->regmap_init call
  regmap: debugfs: Fix misuse of IS_ENABLED
  ...
2015-09-08 16:48:55 -07:00
Linus Torvalds
fa815580fb Merge tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
Pull fbdev updates from Tomi Valkeinen:
 "Minor fixes and cleanups"

* tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
  video: fbdev: atmel_lcdfb: remove useless include
  video: fbdev: pxa168fb: Use devm_clk_get
  fbdev: ssd1307fb: fix error return code
  fbdev: fix snprintf() limit in show_bl_curve()
  video: fbdev: s3c-fb: Constify platform_device_id
  video: fbdev: atmel: fix warning for const return value
  video: fbdev: Drop owner assignment from platform_driver
  video: fbdev: Drop owner assignment from i2c_driver
  fbdev: remove unnecessary memset in vfb
  framebuffer: disable vgacon on microblaze arch
  fbdev: udlfb: remove unneeded initialization in few places
  fbdev: Allow compile test of GPIO consumers if !GPIOLIB
  fbdev: fix cea_modes array size
2015-09-08 16:42:55 -07:00
Linus Torvalds
85579ad7f1 Merge tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC updates from Ulf Hansson:
 "MMC core:
   - Fix a race condition in the request handling
   - Skip trim commands for some buggy kingston eMMCs
   - An optimization and a correction for erase groups
   - Set CMD23 quirk for some Sandisk cards

  MMC host:
   - sdhci: Give GPIO CD higher precedence and don't poll when it's used
   - sdhci: Fix DMA memory leakage
   - sdhci: Some updates for clock management
   - sdhci-of-at91: introduce driver for the Atmel SDMMC
   - sdhci-of-arasan: Add support for sdhci-5.1
   - sdhci-esdhc-imx: Add support for imx7d which also supports HS400
   - sdhci: A collection of fixes and improvements for various sdhci hosts
   - omap_hsmmc: Modernization of the regulator code
   - dw_mmc: A couple of fixes for DMA and PIO mode
   - usdhi6rol0: A few fixes and support probe deferral for regulators
   - pxamci: Convert to use dmaengine
   - sh_mmcif: Fix the suspend process in a short term solution
   - tmio: Adjust timeout for commands
   - sunxi: Fix timeout while gating/ungating clock"

* tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc: (67 commits)
  mmc: android-goldfish: remove incorrect __iomem annotation
  mmc: core: fix race condition in mmc_wait_data_done
  mmc: host: omap_hsmmc: remove CONFIG_REGULATOR check
  mmc: host: omap_hsmmc: use ios->vdd for setting vmmc voltage
  mmc: host: omap_hsmmc: use regulator_is_enabled to find pbias status
  mmc: host: omap_hsmmc: enable/disable vmmc_aux regulator based on previous state
  mmc: host: omap_hsmmc: don't use ->set_power to set initial regulator state
  mmc: host: omap_hsmmc: avoid pbias regulator enable on power off
  mmc: host: omap_hsmmc: add separate function to set pbias
  mmc: host: omap_hsmmc: add separate functions for enable/disable supply
  mmc: host: omap_hsmmc: return error if any of the regulator APIs fail
  mmc: host: omap_hsmmc: remove unnecessary pbias set_voltage
  mmc: host: omap_hsmmc: use mmc_host's vmmc and vqmmc
  mmc: host: omap_hsmmc: use the ocrmask provided by the vmmc regulator
  mmc: host: omap_hsmmc: cleanup omap_hsmmc_reg_get()
  mmc: host: omap_hsmmc: return on fatal errors from omap_hsmmc_reg_get
  mmc: host: omap_hsmmc: use devm_regulator_get_optional() for vmmc
  mmc: sdhci-of-at91: fix platform_no_drv_owner.cocci warnings
  mmc: sh_mmcif: Fix suspend process
  mmc: usdhi6rol0: fix error return code
  ...
2015-09-08 16:33:16 -07:00
Linus Torvalds
acceba598e Merge branch 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
 "Features:

   - new drivers: Renesas EMEV2, register based MUX, NXP LPC2xxx
   - core: scans DT and assigns wakeup interrupts.  no driver changes needed.
   - core: some refcouting issues fixed and better API for that
   - core: new helper function for best effort block read emulation
   - slave framework: proper DT bindings and userspace instantiation
   - some bigger work for xiic, pxa, omap drivers

  .. and quite a number of smaller driver fixes, cleanups, improvements"

* 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (65 commits)
  i2c: mux: reg Change ioread endianness for readback
  i2c: mux: reg: fix compilation warnings
  i2c: mux: reg: simplify register size checking
  i2c: muxes: fix leaked i2c adapter device node references
  i2c: allow specifying separate wakeup interrupt in device tree
  of/irq: export of_get_irq_byname()
  i2c: xgene-slimpro: dma_mapping_error() doesn't return an error code
  i2c: Replace I2C_CROS_EC_TUNNEL dependency
  eeprom: at24: use i2c_smbus_read_i2c_block_data_or_emulated
  i2c: core: Add support for best effort block read emulation
  i2c: lpc2k: add driver
  i2c: mux: Add register-based mux i2c-mux-reg
  i2c: dt: describe generic bindings
  i2c: slave: print warning if slave flag not set
  i2c: support 10 bit and slave addresses in sysfs 'new_device'
  i2c: take address space into account when checking for used addresses
  i2c: apply DT flags when probing
  i2c: make address check indpendent from client struct
  i2c: rename address check functions
  i2c: apply address offset for slaves, too
  ...
2015-09-08 16:16:26 -07:00
Krzysztof Kozlowski
c83db4f419 mm: zbud: constify the zbud_ops
The structure zbud_ops is not modified so make the pointer to it a
pointer to const.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Krzysztof Kozlowski
786727799a mm: zpool: constify the zpool_ops
The structure zpool_ops is not modified so make the pointer to it a
pointer to const.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Dmitry Safonov
5b999aadba mm: swap: zswap: maybe_preload & refactoring
zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
same code with only significant difference in return value and usage of
swap_readpage.

I a helper __read_swap_cache_async() with the common code.  Behavior
change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
instead radix_tree_preload.  Looks like, this wasn't changed only by the
reason of code duplication.

Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Sergey Senozhatsky
860c707dca zsmalloc: account the number of compacted pages
Compaction returns back to zram the number of migrated objects, which is
quite uninformative -- we have objects of different sizes so user space
cannot obtain any valuable data from that number.  Change compaction to
operate in terms of pages and return back to compaction issuer the
number of pages that were freed during compaction.  So from now on we
will export more meaningful value in zram<id>/mm_stat -- the number of
freed (compacted) pages.

This requires:
 (a) a rename of `num_migrated' to 'pages_compacted'
 (b) a internal API change -- return first_page's fullness_group from
     putback_zspage(), so we know when putback_zspage() did
     free_zspage().  It helps us to account compaction stats correctly.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Sergey Senozhatsky
7d3f393823 zsmalloc/zram: introduce zs_pool_stats api
`zs_compact_control' accounts the number of migrated objects but it has
a limited lifespan -- we lose it as soon as zs_compaction() returns back
to zram.  It worked fine, because (a) zram had it's own counter of
migrated objects and (b) only zram could trigger compaction.  However,
this does not work for automatic pool compaction (not issued by zram).
To account objects migrated during auto-compaction (issued by the
shrinker) we need to store this number in zs_pool.

Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
there.  It provides only `num_migrated', as of this writing, but it
surely can be extended.

A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
caller.

Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Vlastimil Babka
82c1fc7147 mm: use numa_mem_id() in alloc_pages_node()
alloc_pages_node() might fail when called with NUMA_NO_NODE and
__GFP_THISNODE on a CPU belonging to a memoryless node.  To make the
local-node fallback more robust and prevent such situations, use
numa_mem_id(), which was introduced for similar scenarios in the slab
context.

Suggested-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Vlastimil Babka
0bc35a970c mm: unify checks in alloc_pages_node() and __alloc_pages_node()
Perform the same debug checks in alloc_pages_node() as are done in
__alloc_pages_node(), by making the former function a wrapper of the
latter one.

In addition to better diagnostics in DEBUG_VM builds for situations
which have been already fatal (e.g.  out-of-bounds node id), there are
two visible changes for potential existing buggy callers of
alloc_pages_node():

- calling alloc_pages_node() with any negative nid (e.g. due to arithmetic
  overflow) was treated as passing NUMA_NO_NODE and fallback to local node was
  applied. This will now be fatal.
- calling alloc_pages_node() with an offline node will now be checked for
  DEBUG_VM builds. Since it's not fatal if the node has been previously online,
  and this patch may expose some existing buggy callers, change the VM_BUG_ON
  in __alloc_pages_node() to VM_WARN_ON.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Vlastimil Babka
96db800f5d mm: rename alloc_pages_exact_node() to __alloc_pages_node()
alloc_pages_exact_node() was introduced in commit 6484eb3e2a ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise.  In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.

The misleading name has lead to mistakes in the past, see for example
commits 5265047ac3 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f ("mm, mempolicy:
migrate_to_node should only migrate to node").

Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage.  Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly.  The number of users would be small
anyway.

Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead.  This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.

Both differences will be rectified by the next patch.

To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers.  Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00