Commit Graph

691898 Commits

Author SHA1 Message Date
Liam R. Howlett
d715cf804a mm/hugetlb.c: warn the user when issues arise on boot due to hugepages
When the user specifies too many hugepages or an invalid
default_hugepagesz the communication to the user is implicit in the
allocation message.  This patch adds a warning when the desired page
count is not allocated and prints an error when the default_hugepagesz
is invalid on boot.

During boot hugepages will allocate until there is a fraction of the
hugepage size left.  That is, we allocate until either the request is
satisfied or memory for the pages is exhausted.  When memory for the
pages is exhausted, it will most likely lead to the system failing with
the OOM manager not finding enough (or anything) to kill (unless you're
using really big hugepages in the order of 100s of MB or in the GBs).
The user will most likely see the OOM messages much later in the boot
sequence than the implicitly stated message.  Worse yet, you may even
get an OOM for each processor which causes many pages of OOMs on modern
systems.  Although these messages will be printed earlier than the OOM
messages, at least giving the user errors and warnings will highlight
the configuration as an issue.  I'm trying to point the user in the
right direction by providing a more robust statement of what is failing.

During the sysctl or echo command, the user can check the results much
easier than if the system hangs during boot and the scenario of having
nothing to OOM for kernel memory is highly unlikely.

Mike said:
 "Before sending out this patch, I asked Liam off list why he was doing
  it. Was it something he just thought would be useful? Or, was there
  some type of user situation/need. He said that he had been called in
  to assist on several occasions when a system OOMed during boot. In
  almost all of these situations, the user had grossly misconfigured
  huge pages.

  DB users want to pre-allocate just the right amount of huge pages, but
  sometimes they can be really off. In such situations, the huge page
  init code just allocates as many huge pages as it can and reports the
  number allocated. There is no indication that it quit allocating
  because it ran out of memory. Of course, a user could compare the
  number in the message to what they requested on the command line to
  determine if they got all the huge pages they requested. The thought
  was that it would be useful to at least flag this situation. That way,
  the user might be able to better relate the huge page allocation
  failure to the OOM.

  I'm not sure if the e-mail discussion made it obvious that this is
  something he has seen on several occasions.

  I see Michal's point that this will only flag the situation where
  someone configures huge pages very badly. And, a more extensive look
  at the situation of misconfiguring huge pages might be in order. But,
  this has happened on several occasions which led to the creation of
  this patch"

[akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: zhongjiang <zhongjiang@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Anshuman Khandual
e35ef6397b mm/cma.c: warn if the CMA area could not be activated
While activating a CMA area we check to make sure that all the PFNs in
the range are inside the same zone.  This is a requirement for
alloc_contig_range() to work.  Any CMA area failing the check is
disabled for good.  This happens silently right now making all future
cma_alloc() allocations failure inevitable.

Here we add an error message stating that the CMA area could not be
activated which makes it easier to explain any future cma_alloc()
failures on it.  While in there, change the bail out goto label from
'err' to 'not_in_zone' which makes more sense.

Link: http://lkml.kernel.org/r/20170605023729.26303-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Yisheng Xie
78c72746f5 vmalloc: show lazy-purged vma info in vmallocinfo
When ioremap a 67112960 bytes vm_area with the vmallocinfo:
 [..]
 0xec79b000-0xec7fa000  389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc
 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap

we get the result:
 0xf1000000-0xf5001000 67112960 devm_ioremap+0x38/0x7c phys=40000000 ioremap

For the align for ioremap must be less than '1 << IOREMAP_MAX_ORDER':

	if (flags & VM_IOREMAP)
		align = 1ul << clamp_t(int, get_count_order_long(size),
			PAGE_SHIFT, IOREMAP_MAX_ORDER);

So it makes idiot like me a litte puzzled why this was a jump the
vm_area from 0xec800000-0xecbe1000 to 0xf1000000-0xf5001000, and leaving
0xed000000-0xf1000000 as a big hole.

This patch is to show all of vm_area, including vmas which are freeing
but still in the vmap_area_list, to make it more clear about why we will
get 0xf1000000-0xf5001000 in the above case.  And we will get a
vmallocinfo like:

 [..]
 0xec79b000-0xec7fa000  389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc
 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap
 [..]
 0xece7c000-0xece7e000    8192 unpurged vm_area
 0xece7e000-0xece83000   20480 vm_map_ram
 0xf0099000-0xf00aa000   69632 vm_map_ram

after this patch.

Link: http://lkml.kernel.org/r/1496649682-20710-1-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: zijun_hu <zijun_hu@htc.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Sean Christopherson
34c8105792 mm/memcontrol: exclude @root from checks in mem_cgroup_low
Make @root exclusive in mem_cgroup_low; it is never considered low when
looked at directly and is not checked when traversing the tree.  In
effect, @root is handled identically to how root_mem_cgroup was
previously handled by mem_cgroup_low.

If @root is not excluded from the checks, a cgroup underneath @root will
never be considered low during targeted reclaim of @root, e.g.  due to
memory.current > memory.high, unless @root is misconfigured to have
memory.low > memory.high.

Excluding @root enables using memory.low to prioritize memory usage
between cgroups within a subtree of the hierarchy that is limited by
memory.high or memory.max, e.g.  when ROOT owns @root's controls but
delegates the @root directory to a USER so that USER can create and
administer children of @root.

For example, given cgroup A with children B and C:

    A
   / \
  B   C

and

  1. A/memory.current > A/memory.high
  2. A/B/memory.current < A/B/memory.low
  3. A/C/memory.current >= A/C/memory.low

As 'A' is high, i.e.  triggers reclaim from 'A', and 'B' is low, we
should reclaim from 'C' until 'A' is no longer high or until we can no
longer reclaim from 'C'.  If 'A', i.e.  @root, isn't excluded by
mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low
and we will reclaim indiscriminately from both 'B' and 'C'.

Here is the test I used to confirm the bug and the patch.

20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test
#!/bin/bash

x62mb=$((62<<20))
x66mb=$((66<<20))
x94mb=$((94<<20))
x98mb=$((98<<20))

setup() {
    set -e

    if [[ -n $DEBUG ]]; then
        set -x
    fi

    trap teardown EXIT HUP INT TERM

    if [[ ! -e /mnt/1gb.swap ]]; then
        sudo fallocate -l 1G /mnt/1gb.swap > /dev/null
        sudo mkswap /mnt/1gb.swap > /dev/null
    fi
    if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then
        sudo swapon /mnt/1gb.swap
    fi

    if [[ ! -e /cgroup/cgroup.controllers ]]; then
        sudo mount -t cgroup2 none /cgroup
    fi

    grep -q memory /cgroup/cgroup.controllers

    sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control"

    sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A
    sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control"
    sudo sh -c "echo '96m' > /cgroup/A/memory.high"

    mkdir /cgroup/A/0
    mkdir /cgroup/A/1

    echo 64m > /cgroup/A/0/memory.low
}

teardown() {
    set +e

    trap - EXIT HUP INT TERM

    if [[ -z $1 ]]; then
        printf "\n"
        printf "%0.s*" {1..35}
        printf "\nFAILED!\n\n"
        tail /cgroup/A/**/memory.current
        printf "%0.s*" {1..35}
        printf "\n\n"
    fi

    ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %

    sleep 2

    if [[ -e /cgroup/A/0 ]]; then
        rmdir /cgroup/A/0
    fi
    if [[ -e /cgroup/A/1 ]]; then
        rmdir /cgroup/A/1
    fi
    if [[ -e /cgroup/A ]]; then
        sudo rmdir /cgroup/A
    fi
}

stress_test() {
    sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs"
    stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &

    sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs"
    stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &

    sudo sh -c "echo $$ > /cgroup/cgroup.procs"

    sleep 1

    # A/0 should be consuming more memory than A/1
    [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]]

    # A/0 should be consuming ~64mb
    [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]]

    # A should cumulatively be consuming ~96mb
    [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]]

    # Stop the stressors
    ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
}

teardown 1
setup

for ((i=1;i<=$1;i++)); do
    printf "ITERATION $i of $1 - stress_test 0 1"
    stress_test 0 1
    printf "\x1b[2K\r"

    printf "ITERATION $i of $1 - stress_test 1 0"
    stress_test 1 0
    printf "\x1b[2K\r"

    printf "ITERATION $i of $1 - PASSED\n"
done

teardown 1

echo PASSED!

20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10

Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.com
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Michal Hocko
1860033237 mm: make PR_SET_THP_DISABLE immediately active
PR_SET_THP_DISABLE has a rather subtle semantic.  It doesn't affect any
existing mapping because it only updated mm->def_flags which is a
template for new mappings.

The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
flag set.  This can be quite surprising for all those applications which
do not do prctl(); fork() & exec() and want to control their own THP
behavior.

Another usecase when the immediate semantic of the prctl might be useful
is a combination of pre- and post-copy migration of containers with
CRIU.  In this case CRIU populates a part of a memory region with data
that was saved during the pre-copy stage.  Afterwards, the region is
registered with userfaultfd and CRIU expects to get page faults for the
parts of the region that were not yet populated.  However, khugepaged
collapses the pages and the expected page faults do not occur.

In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
temporary mechanism for enabling/disabling THP process wide.

Implementation wise, a new MMF_DISABLE_THP flag is added.  This flag is
tested when decision whether to use huge pages is taken either during
page fault of at the time of THP collapse.

It should be noted, that the new implementation makes PR_SET_THP_DISABLE
master override to any per-VMA setting, which was not the case
previously.

Fixes: a0715cc226 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
David Rientjes
b6bb981149 mm, vmpressure: pass-through notification support
By default, vmpressure events are not pass-through, i.e.  they propagate
up through the memcg hierarchy until an event notifier is found for any
threshold level.

This presents a difficulty when a thread waiting on a read(2) for a
vmpressure event cannot distinguish between local memory pressure and
memory pressure in a descendant memcg, especially when that thread may
not control the memcg hierarchy.

Consider a user-controlled child memcg with a smaller limit than a
top-level memcg controlled by the "Activity Manager" specified in
Documentation/cgroup-v1/memory.txt.  It may register for memory pressure
notification for descendant memcgs to make a policy decision: oom kill a
low priority job, increase the limit, decrease other limits, etc.  If it
registers for memory pressure notification on the top-level memcg, it
currently cannot distinguish between memory pressure in its own memcg or
a descendant memcg, which is user-controlled.

Conversely, if a user registers for memory pressure notification on
their own descendant memcg, the Activity Manager does not receive any
pressure notification for that child memcg hierarchy.  Vmpressure events
are not received for ancestor memcgs if the memcg experiencing pressure
have notifiers registered, perhaps outside the knowledge of the thread
waiting on read(2) at the top level.

Both of these are consequences of vmpressure notification not being
pass-through.

This implements a pass-through behavior for vmpressure events.  When
writing to control.event_control, vmpressure event handlers may
optionally specify a mode.  There are two new modes:

 - "hierarchy": always propagate memory pressure events up the hierarchy
   regardless if descendant memcgs have their own notifiers registered,
   and

 - "local": only receive notifications when the memcg for which the
   event is registered experiences memory pressure.

Of course, processes may register for one notification of "low,local",
for example, and another for "low".

If no mode is specified, the current behavior is maintained for
backwards compatibility.

See the change to Documentation/cgroup-v1/memory.txt for full
specification.

[dan.carpenter@oracle.com: free the same pointer we allocated]
  Link: http://lkml.kernel.org/r/20170613191820.GA20003@elgon.mountain
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705311421320.8946@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Anton Vorontsov <anton@enomsg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Naoya Horiguchi
0348d2ebec mm: hwpoison: introduce idenfity_page_state
Factoring duplicate code into a function.

Link: http://lkml.kernel.org/r/1496305019-5493-10-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Naoya Horiguchi
ddd40d8a2c mm: hugetlb: delete dequeue_hwpoisoned_huge_page()
dequeue_hwpoisoned_huge_page() is no longer used, so let's remove it.

Link: http://lkml.kernel.org/r/1496305019-5493-9-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Naoya Horiguchi
78bb920344 mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error
Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
keep the error hugepage away from the system, which is OK but not good
enough because the hugepage still has a refcount and unpoison doesn't
work on the error hugepage (PageHWPoison flags are cleared but pages are
still leaked.) And there's "wasting health subpages" issue too.  This
patch reworks on me_huge_page() to solve these issues.

For hugetlb file, recently we have truncating code so let's use it in
hugetlbfs specific ->error_remove_page().

For anonymous hugepage, it's helpful to dissolve the error page after
freeing it into free hugepage list.  Migration entry and PageHWPoison in
the head page prevent the access to it.

TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
It's not critical (and at least no worse that now) because in such case
the error hugepage just stays in free hugepage list without being
dissolved.  By virtue of PageHWPoison in head page, it's never allocated
to processes.

[akpm@linux-foundation.org: fix unused var warnings]
Fixes: 23a003bfd2 ("mm/madvise: pass return code of memory_failure() to userspace")
Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Naoya Horiguchi
761ad8d7c7 mm: hwpoison: introduce memory_failure_hugetlb()
memory_failure() is a big function and hard to maintain.  Handling
hugetlb- and non-hugetlb- case in a single function is not good, so this
patch separates PageHuge() branch into a new function, which saves many
PageHuge() check.

Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Naoya Horiguchi
d4a3a60b37 mm: soft-offline: dissolve free hugepage if soft-offlined
Now we have code to rescue most of healthy pages from a hwpoisoned
hugepage.  So let's apply it to soft_offline_free_page too.

Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Anshuman Khandual
c3114a84f7 mm: hugetlb: soft-offline: dissolve source hugepage after successful migration
Currently hugepage migrated by soft-offline (i.e.  due to correctable
memory errors) is contained as a hugepage, which means many non-error
pages in it are unreusable, i.e.  wasted.

This patch solves this issue by dissolving source hugepages into buddy.
As done in previous patch, PageHWPoison is set only on a head page of
the error hugepage.  Then in dissoliving we move the PageHWPoison flag
to the raw error page so that all healthy subpages return back to buddy.

[arnd@arndb.de: fix warnings: replace some macros with inline functions]
  Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Naoya Horiguchi
b37ff71cc6 mm: hwpoison: change PageHWPoison behavior on hugetlb pages
We'd like to narrow down the error region in memory error on hugetlb
pages.  However, currently we set PageHWPoison flags on all subpages in
the error hugepage and add # of subpages to num_hwpoison_pages, which
doesn't fit our purpose.

So this patch changes the behavior and we only set PageHWPoison on the
head page then increase num_hwpoison_pages only by 1.  This is a
preparation for narrow-down part which comes in later patches.

Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Naoya Horiguchi
09612fa653 mm: hugetlb: return immediately for hugetlb page in __delete_from_page_cache()
We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page
now, but it's not enough because later code doesn't handle hugetlb
properly.  Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the
end of this function fires for hugetlb, which makes no sense.  So we
should return immediately for hugetlb pages.

Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Naoya Horiguchi
243abd5b78 mm: hugetlb: prevent reuse of hwpoisoned free hugepages
Patch series "mm: hwpoison: fixlet for hugetlb migration".

This patchset updates the hwpoison/hugetlb code to address 2 reported
issues.

One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
half was already fixed in mainline, and another half about hugetlb cases
are solved in this series.

Another issue is "narrow-down error affected region into a single 4kB
page instead of a whole hugetlb page" issue, which was tried by Anshuman
(http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
and I updated it to apply it more widely.

This patch (of 9):

We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
as we did before.  So current dequeue_huge_page_node() doesn't work as
intended because it still uses is_migrate_isolate_page() for this check.
This patch fixes it with PageHWPoison flag.

Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Eric Biggers
241f01fbed fs/buffer.c: make bh_lru_install() more efficient
To install a buffer_head into the cpu's LRU queue, bh_lru_install()
would construct a new copy of the queue and then memcpy it over the real
queue.  But it's easily possible to do the update in-place, which is
faster and simpler.  Some work can also be skipped if the buffer_head
was already in the queue.

As a microbenchmark I timed how long it takes to run sb_getblk()
10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
Effectively, this benchmarks looking up buffer_heads that are in the
page cache but not in the LRU:

	Before this patch: 1.758s
	After this patch: 1.653s

This patch also removes about 350 bytes of compiled code (on x86_64),
partly due to removal of the memcpy() which was being inlined+unrolled.

Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Nick Desaulniers
3457f41476 mm/zsmalloc.c: fix -Wunneeded-internal-declaration warning
is_first_page() is only called from the macro VM_BUG_ON_PAGE() which is
only compiled in as a runtime check when CONFIG_DEBUG_VM is set,
otherwise is checked at compile time and not actually compiled in.

Fixes the following warning, found with Clang:

  mm/zsmalloc.c:472:12: warning: function 'is_first_page' is not needed and will not be emitted [-Wunneeded-internal-declaration]
  static int is_first_page(struct page *page)
           ^

Link: http://lkml.kernel.org/r/20170524053859.29059-1-nick.desaulniers@gmail.com
Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Gustavo A. R. Silva
dbac61a3f2 mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference
The NULL check at line 1226: if (!pgdat), implies that pointer pgdat
might be NULL.

rollback_node_hotadd() dereferences this pointer.  Add NULL check to
avoid a potential NULL pointer dereference.

Addresses-Coverity-ID: 1369133
Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
David Rientjes
0622622677 mm, vmscan: avoid thrashing anon lru when free + file is low
The purpose of the code that commit 623762517e ("revert 'mm: vmscan:
do not swap anon pages just because free+file is low'") reintroduces is
to prefer swapping anonymous memory rather than trashing the file lru.

If the anonymous inactive lru for the set of eligible zones is
considered low, however, or the length of the list for the given reclaim
priority does not allow for effective anonymous-only reclaiming, then
avoid forcing SCAN_ANON.  Forcing SCAN_ANON will end up thrashing the
small list and leave unreclaimed memory on the file lrus.

If the inactive list is insufficient, fallback to balanced reclaim so
the file lru doesn't remain untouched.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Yevgen Pronenko
0a1345f8fe mm/memory.c: convert to DEFINE_DEBUGFS_ATTRIBUTE
The preferred strategy to define debugfs attributes is to use the
DEFINE_DEBUGFS_ATTRIBUTE() macro and to use debugfs_create_file_unsafe().

Link: http://lkml.kernel.org/r/20170528145948.32127-1-y.pronenko@gmail.com
Signed-off-by: Yevgen Pronenko <y.pronenko@gmail.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Vlastimil Babka
7a8f58f391 mm, page_alloc: fallback to smallest page when not stealing whole pageblock
Since commit 3bc48f96cf ("mm, page_alloc: split smallest stolen page
in fallback") we pick the smallest (but sufficient) page of all that
have been stolen from a pageblock of different migratetype.  However,
there are cases when we decide not to steal the whole pageblock.

Practically in the current implementation it means that we are trying to
fallback for a MIGRATE_MOVABLE allocation of order X, go through the
freelists from MAX_ORDER-1 down to X, and find free page of order Y.  If
Y is less than pageblock_order / 2, we decide not to steal all pages
from the pageblock.  When Y > X, it means we are potentially splitting a
larger page than we need, as there might be other pages of order Z,
where X <= Z < Y.  Since Y is already too small to steal whole
pageblock, picking smallest available Z will result in the same decision
and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or
MIGRATE_RECLAIMABLE pageblock.

This patch therefore changes the fallback algorithm so that in the
situation described above, we switch the fallback search strategy to go
from order X upwards to find the smallest suitable fallback.  In theory
there shouldn't be a downside of this change wrt fragmentation.

This has been tested with mmtests' stress-highalloc performing
GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint
statistics:

                                                        4.12.0-rc2      4.12.0-rc2
                                                         1-kernel4       2-kernel4
  Page alloc extfrag event                                  25640976    69680977
  Extfrag fragmenting                                       25621086    69661364
  Extfrag fragmenting for unmovable                            74409       73204
  Extfrag fragmenting unmovable placed with movable            69003       67684
  Extfrag fragmenting unmovable placed with reclaim.            5406        5520
  Extfrag fragmenting for reclaimable                           6398        8467
  Extfrag fragmenting reclaimable placed with movable            869         884
  Extfrag fragmenting reclaimable placed with unmov.            5529        7583
  Extfrag fragmenting for movable                           25540279    69579693

Since we force movable allocations to steal the smallest available page
(which we then practially always split), we steal less per fallback, so
the number of fallbacks increases and steals potentially happen from
different pageblocks.  This is however not an issue for movable pages
that can be compacted.

Importantly, the "unmovable placed with movable" statistics is lower,
which is the result of less fragmentation in the unmovable pageblocks.
The effect on reclaimable allocation is a bit unclear.

Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Shaohua Li
23955622ff swap: add block io poll in swapin path
For fast flash disk, async IO could introduce overhead because of
context switch.  block-mq now supports IO poll, which improves
performance and latency a lot.  swapin is a good place to use this
technique, because the task is waiting for the swapin page to continue
execution.

In my virtual machine, directly read 4k data from a NVMe with iopoll is
about 60% better than that without poll.  With iopoll support in swapin
patch, my microbenchmark (a task does random memory write) is about
10%~25% faster.  CPU utilization increases a lot though, 2x and even 3x
CPU utilization.  This will depend on disk speed.

While iopoll in swapin isn't intended for all usage cases, it's a win
for latency sensistive workloads with high speed swap disk.  block layer
has knob to control poll in runtime.  If poll isn't enabled in block
layer, there should be no noticeable change in swapin.

I got a chance to run the same test in a NVMe with DRAM as the media.
In simple fio IO test, blkpoll boosts 50% performance in single thread
test and ~20% in 8 threads test.  So this is the base line.  In above
swap test, blkpoll boosts ~27% performance in single thread test.
blkpoll uses 2x CPU time though.

If we enable hybid polling, the performance gain has very slight drop
but CPU time is only 50% worse than that without blkpoll.  Also we can
adjust parameter of hybid poll, with it, the CPU time penality is
reduced further.  In 8 threads test, blkpoll doesn't help though.  The
performance is similar to that without blkpoll, but cpu utilization is
similar too.  There is lock contention in swap path.  The cpu time
spending on blkpoll isn't high.  So overall, blkpoll swapin isn't worse
than that without it.

The swapin readahead might read several pages in in the same time and
form a big IO request.  Since the IO will take longer time, it doesn't
make sense to do poll, so the patch only does iopoll for single page
swapin.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Benson Leung
3c778a7fcf platform/chrome : Add myself as Maintainer
I'll be taking over maintainership of platform/chrome from Olof,
so let's add me to the list of maintainers.

Signed-off-by: Benson Leung <bleung@chromium.org>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Olof Johansson <olof@lixom.net>
2017-07-10 16:15:45 -07:00
Linus Torvalds
548aa0e3c5 Merge tag 'devprop-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull device properties framework updates from Rafael Wysocki:
 "These mostly rearrange the device properties core code and add a few
  helper functions to it as a foundation for future work.

  Specifics:

   - Rearrange the core device properties code by moving the code
     specific to each supported platform configuration framework (ACPI,
     DT and build-in) into a separate file (Sakari Ailus).

   - Add helper functions for accessing device properties in a
     firmware-agnostic way (Sakari Ailus, Kieran Bingham)"

* tag 'devprop-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  device property: Add fwnode_graph_get_port_parent
  device property: Add FW type agnostic fwnode_graph_get_remote_node
  device property: Introduce fwnode_device_is_available()
  device property: Move fwnode graph ops to firmware specific locations
  device property: Move FW type specific functionality to FW specific files
  ACPI: Constify argument to acpi_device_is_present()
2017-07-10 15:23:45 -07:00
Linus Torvalds
3226186843 Merge tag 'acpi-extra-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI updates from Rafael Wysocki:
 "These fix the ACPI SPCR table handling and add a workaround for APM
  X-Gene 8250 UART on top of that, fix two ACPI hotplug issues related
  to hot-remove failures, add a missing "static" to one function and
  constify some attribute_group structures.

  Specifics:

   - Fix the ACPI code handling the SPCR table to check access width of
     MMIO regions and add a workaround for APM X-Gene 8250 UART to use
     32-bit MMIO accesses with its register (Loc Ho).

   - Fix two ACPI-based hotplug issues related to the handling of
     hot-remove failures on the OS side (Chun-Yi Lee).

   - Constify attribute_group structures in a few places (Arvind Yadav).

   - Make one local function static (Colin Ian King)"

* tag 'acpi-extra-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / DPTF: constify attribute_group structures
  ACPI / LPSS: constify attribute_group structures
  ACPI: BGRT: constify attribute_group structures
  ACPI / power: constify attribute_group structures
  ACPI / scan: Indicate to platform when hot remove returns busy
  ACPI / bus: handle ACPI hotplug schedule errors completely
  ACPI / osi: Make local function acpi_osi_dmi_linux() static
  ACPI: SPCR: Workaround for APM X-Gene 8250 UART 32-alignment errata
  ACPI: SPCR: Use access width to determine mmio usage
2017-07-10 15:19:40 -07:00
Linus Torvalds
1633b39610 Merge tag 'pm-extra-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
 "These revert one recent change in the generic power domains
  framework, fix a recently introduced build issue in there and
  constify attribute_group structures in some places.

  Specifics:

   - Revert a recent change in the generic power domains (genpd)
     framework that led to regressions and turned out the be misguided
     (Rafael Wysocki).

   - Fix a recently introduced build issue in the generic power domains
     (genpd) framework (Arnd Bergmann).

   - Constify attribute_group structures in the PM core, the cpufreq
     stats code and in intel_pstate (Arvind Yadav)"

* tag 'pm-extra-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: constify attribute_group structures
  cpufreq: cpufreq_stats: constify attribute_group structures
  PM / sleep: constify attribute_group structures
  PM / Domains: provide pm_genpd_poweroff_noirq() stub
  Revert "PM / Domains: Handle safely genpd_syscore_switch() call on non-genpd device"
2017-07-10 15:16:21 -07:00
Linus Torvalds
5cdd4c0468 Merge tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added new features such as disk quota and statx,
  and modified internal bio management flow to merge more IOs depending
  on block types. We've also made internal threads freezeable for
  Android battery life. In addition to them, there are some patches to
  avoid lock contention as well as a couple of deadlock conditions.

  Enhancements:
   - support usrquota, grpquota, and statx
   - manage DATA/NODE typed bios separately to serialize more IOs
   - modify f2fs_lock_op/wio_mutex to avoid lock contention
   - prevent lock contention in migratepage

  Bug fixes:
   - fix missing load of written inode flag
   - fix worst case victim selection in GC
   - freezeable GC and discard threads for Android battery life
   - sanitize f2fs metadata to deal with security hole
   - clean up sysfs-related code and docs"

* tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (59 commits)
  f2fs: support plain user/group quota
  f2fs: avoid deadlock caused by lock order of page and lock_op
  f2fs: use spin_{,un}lock_irq{save,restore}
  f2fs: relax migratepage for atomic written page
  f2fs: don't count inode block in in-memory inode.i_blocks
  Revert "f2fs: fix to clean previous mount option when remount_fs"
  f2fs: do not set LOST_PINO for renamed dir
  f2fs: do not set LOST_PINO for newly created dir
  f2fs: skip ->writepages for {mete,node}_inode during recovery
  f2fs: introduce __check_sit_bitmap
  f2fs: stop gc/discard thread in prior during umount
  f2fs: introduce reserved_blocks in sysfs
  f2fs: avoid redundant f2fs_flush after remount
  f2fs: report # of free inodes more precisely
  f2fs: add ioctl to do gc with target block address
  f2fs: don't need to check encrypted inode for partial truncation
  f2fs: measure inode.i_blocks as generic filesystem
  f2fs: set CP_TRIMMED_FLAG correctly
  f2fs: require key for truncate(2) of encrypted file
  f2fs: move sysfs code from super.c to fs/f2fs/sysfs.c
  ...
2017-07-10 14:29:45 -07:00
Richard Weinberger
61e8d46245 um: Correctly check for PTRACE_GETRESET/SETREGSET
When checking for PTRACE_GETRESET/SETREGSET, make sure that
the correct header file is included. We need linux/ptrace.h
which contains all ptrace UAPI related defines.
Otherwise #if defined(PTRACE_GETRESET) is always false.

Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-10 22:58:06 +02:00
Thomas Meyer
94df7fe0d3 um: v2: Use generic NOTES macro
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-10 22:58:02 +02:00
Rafael J. Wysocki
f19e80b394 Merge branches 'acpi-spcr', 'acpi-osi', 'acpi-bus', 'acpi-scan' and 'acpi-misc'
* acpi-spcr:
  ACPI: SPCR: Workaround for APM X-Gene 8250 UART 32-alignment errata
  ACPI: SPCR: Use access width to determine mmio usage

* acpi-osi:
  ACPI / osi: Make local function acpi_osi_dmi_linux() static

* acpi-bus:
  ACPI / bus: handle ACPI hotplug schedule errors completely

* acpi-scan:
  ACPI / scan: Indicate to platform when hot remove returns busy

* acpi-misc:
  ACPI / DPTF: constify attribute_group structures
  ACPI / LPSS: constify attribute_group structures
  ACPI: BGRT: constify attribute_group structures
  ACPI / power: constify attribute_group structures
2017-07-10 22:46:21 +02:00
Rafael J. Wysocki
15d56b3921 Merge branches 'pm-domains', 'pm-sleep' and 'pm-cpufreq'
* pm-domains:
  PM / Domains: provide pm_genpd_poweroff_noirq() stub
  Revert "PM / Domains: Handle safely genpd_syscore_switch() call on non-genpd device"

* pm-sleep:
  PM / sleep: constify attribute_group structures

* pm-cpufreq:
  cpufreq: intel_pstate: constify attribute_group structures
  cpufreq: cpufreq_stats: constify attribute_group structures
2017-07-10 22:45:16 +02:00
Jin Yao
80f62589fa perf annotate: Fix broken arrow at row 0 connecting jmp instruction to its target
When the jump instruction is displayed at the row 0 in annotate view,
the arrow is broken. An example:

 16.86 │   ┌──je     82
  0.01 │      movsd  (%rsp),%xmm0
       │      movsd  0x8(%rsp),%xmm4
       │      movsd  0x8(%rsp),%xmm1
       │      movsd  (%rsp),%xmm3
       │      divsd  %xmm4,%xmm0
       │      divsd  %xmm3,%xmm1
       │      movsd  (%rsp),%xmm2
       │      addsd  %xmm1,%xmm0
       │      addsd  %xmm2,%xmm0
       │      movsd  %xmm0,(%rsp)
       │82:   sub    $0x1,%ebx
 83.03 │    ↑ jne    38
       │      add    $0x10,%rsp
       │      xor    %eax,%eax
       │      pop    %rbx
       │    ← retq

The patch increments the row number before checking with 0.

Signed-off-by: Yao Jin <yao.jin@linux.intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Fixes: 944e1abed9 ("perf ui browser: Add method to draw up/down arrow line")
Link: http://lkml.kernel.org/r/1496901704-30275-1-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-07-10 16:36:40 -03:00
Arnaldo Carvalho de Melo
ede5626d30 perf evsel: State in the default event name if attr.exclude_kernel is set
When no event is specified perf will use the "cycles" hardware event
with the highest precision available in the processor, and excluding
kernel events for non-root users, so make that clear in the event name
by setting the "u" event modifier, i.e. "cycles:upp".

E.g.:

The default for root:

  # perf record usleep 1
  # perf evlist -v
  cycles:ppp: ..., precise_ip: 3, exclude_kernel: 0, ...
  #

And for !root:

  $ perf record usleep 1
  $ perf evlist -v
  cycles:uppp: ... , precise_ip: 3, exclude_kernel: 1, ...
  $

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-lf29zcdl422i9knrgde0uwy3@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-07-10 16:19:25 -03:00
Arnaldo Carvalho de Melo
d37a369790 perf evsel: Fix attr.exclude_kernel setting for default cycles:p
To allow probing the max attr.precise_ip setting for non-root users
we unconditionally set attr.exclude_kernel, which makes the detection
work but should be done only for !root, fix it.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Fixes: 97365e8136 ("perf evsel: Set attr.exclude_kernel when probing max attr.precise_ip")
Link: http://lkml.kernel.org/n/tip-bl6bbxzxloonzvm4nvt7oqgj@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-07-10 16:14:48 -03:00
Shaohua Li
b222dd2fdd block: call bio_uninit in bio_endio
bio_free isn't a good place to free cgroup info. There are a
lot of cases bio is allocated in special way (for example, in stack) and
never gets called by bio_put hence bio_free, we are leaking memory. This
patch moves the free to bio endio, which should be called anyway. The
bio_uninit call in bio_free is kept, in case the bio never gets called
bio endio.

This assumes ->bi_end_io() doesn't access cgroup info, which seems true
in my audit.

This along with Christoph's integrity patch should fix the memory leak
issue.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-10 12:43:33 -06:00
Linus Torvalds
7cee9384cb Fix up over-eager 'wait_queue_t' renaming
Commit ac6424b981 ("sched/wait: Rename wait_queue_t =>
wait_queue_entry_t") had scripted the renaming incorrectly, and didn't
actually check that the 'wait_queue_t' was a full token.

As a result, it also triggered on 'wait_queue_token', and renamed that
to 'wait_queue_entry_token' entry in the autofs4 packet structure
definition too.  That was entirely incorrect, and not intended.

The end result built fine when building just the kernel - because
everything had been renamed consistently there - but caused problems in
user space because the "struct autofs_packet_missing" type is exported
as part of the uapi.

This scripts it all back again:

    git grep -lw wait_queue_entry_token |
        xargs sed -i 's/wait_queue_entry_token/wait_queue_token/g'

and checks the end result.

Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Fixes: ac6424b981 ("sched/wait: Rename wait_queue_t => wait_queue_entry_t")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 11:40:19 -07:00
Arnd Bergmann
de92cd6cf4 net/mlx5: IPSec, fix 64-bit division correctly
The new IPSec offload code introduced a build error:

drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.o: In function `mlx5e_ipsec_build_inverse_table':
ipsec_rxtx.c:(.text+0x556): undefined reference

Another patch was added on top to fix the build error, but
that introduced a new bug, as we now use the remainder of
the division rather than the result.

This makes it use the correct helper function instead.

Fixes: 5dfd87b67c ("net/mlx5: IPSec, Fix 64-bit division on 32-bit builds")
Fixes: 2ac9cfe782 ("net/mlx5e: IPSec, Add Innova IPSec offload TX data path")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-10 19:34:00 +01:00
Gustavo A. R. Silva
6f6e0b217a drm/rockchip: fix NULL check on devm_kzalloc() return value
The right variable to check here is port, not dp.

This issue was detected using Coccinelle and the following semantic patch:

@@
expression x;
identifier fld;
@@

* x = devm_kzalloc(...);
  ... when != x == NULL
  x->fld

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Acked-by: Mark Yao <mark.yao@rock-chips.com>
Signed-off-by: Sean Paul <seanpaul@chromium.org>
Link: http://patchwork.freedesktop.org/patch/msgid/20170706215833.GA25411@embeddedgus
2017-07-10 14:13:00 -04:00
Linus Torvalds
9eb7888005 Merge tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmi
Pull IPMI updates from Corey Minyard:
 "Some small fixes for IPMI, and one medium sized changed.

  The medium sized change is adding a platform device for IPMI entries
  in the DMI table. Otherwise there is no auto loading for IPMI devices
  if they are only in the DMI table"

* tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmi:
  ipmi:ssif: Add missing unlock in error branch
  char: ipmi: constify bmc_dev_attr_group and bmc_device_type
  ipmi:ssif: Check dev before setting drvdata
  ipmi: Convert DMI handling over to a platform device
  ipmi: Create a platform device for a DMI-specified IPMI interface
  ipmi: use rcu lock around call to intf->handlers->sender()
  ipmi:ssif: Use i2c_adapter_id instead of adapter->nr
  ipmi: Use the proper default value for register size in ACPI
  ipmi_ssif: remove redundant null check on array client->adapter->name
  ipmi/watchdog: fix watchdog timeout set on reboot
  ipmi_ssif: unlock on allocation failure
2017-07-10 10:59:29 -07:00
Linus Torvalds
642338ba33 Merge tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS updates from Darrick Wong:
 "Here are some changes for you for 4.13. For the most part it's fixes
  for bugs and deadlock problems, and preparation for online fsck in
  some future merge window.

   - Avoid quotacheck deadlocks

   - Fix transaction overflows when bunmapping fragmented files

   - Refactor directory readahead

   - Allow admin to configure if ASSERT is fatal

   - Improve transaction usage detail logging during overflows

   - Minor cleanups

   - Don't leak log items when the log shuts down

   - Remove double-underscore typedefs

   - Various preparation for online scrubbing

   - Introduce new error injection configuration sysfs knobs

   - Refactor dq_get_next to use extent map directly

   - Fix problems with iterating the page cache for unwritten data

   - Implement SEEK_{HOLE,DATA} via iomap

   - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

   - Don't use MAXPATHLEN to check on-disk symlink target lengths"

* tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
  xfs: don't crash on unexpected holes in dir/attr btrees
  xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
  xfs: fix contiguous dquot chunk iteration livelock
  xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
  vfs: Add iomap_seek_hole and iomap_seek_data helpers
  vfs: Add page_cache_seek_hole_data helper
  xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
  xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
  xfs: Check for m_errortag initialization in xfs_errortag_test
  xfs: grab dquots without taking the ilock
  xfs: fix semicolon.cocci warnings
  xfs: Don't clear SGID when inheriting ACLs
  xfs: free cowblocks and retry on buffered write ENOSPC
  xfs: replace log_badcrc_factor knob with error injection tag
  xfs: convert drop_writes to use the errortag mechanism
  xfs: remove unneeded parameter from XFS_TEST_ERROR
  xfs: expose errortag knobs via sysfs
  xfs: make errortag a per-mountpoint structure
  xfs: free uncommitted transactions during log recovery
  xfs: don't allow bmap on rt files
  ...
2017-07-10 10:51:53 -07:00
Jens Axboe
459bd0dc39 Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-linus
Pull followup NVMe (mostly) changes from Sagi:

I added the quiesce/unquiesce patches in here as it's
easy for me easily apply changes on top. It has accumulated
reviews and includes mostly nvme anyway, please tell me if
you don't want to take them with this.

This includes:
- quiesce/unquiesce fixes in nvme and others from me
- nvme-fc add create association padding spec updates from James
- some more quirking from MKP
- nvmet nit cleanup from Max
- Fix nvme-rdma racy RDMA completion signalling from Marta
- some centralization patches from me
- add tagset nr_hw_queues updates on controller resets in
  nvme drivers from me
- nvme-rdma fix resources recycling when doing error recovery from me
- minor cleanups in nvme-fc from me
2017-07-10 11:44:34 -06:00
Xiao Ni
b5d27718f3 Raid5 should update rdev->sectors after reshape
The raid5 md device is created by the disks which we don't use the total size. For example,
the size of the device is 5G and it just uses 3G of the devices to create one raid5 device.
Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid
and assemble it again. It fails.
mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean
mdadm /dev/md0 --grow --chunk=64
wait reshape to finish
mdadm -S /dev/md0
mdadm -As
The error messages:
[197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing!
[197519.821686] md: md_import_device returned -22

After reshape the data offset is changed. It selects backwards direction in this condition.
In function super_1_load it compares the available space of the underlying device with
sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL.
rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based
on rdev->sectors. So add md_finish_reshape in end_reshape.

Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-10 10:43:58 -07:00
Guoqing Jiang
4aaf7694f8 md/bitmap: don't read page from device with Bitmap_sync
The device owns Bitmap_sync flag needs recovery
to become in sync, and read page from this type
device could get stale status.

Also add comments for Bitmap_sync bit per the
suggestion from Shaohua and Neil.

Previous disscussion can be found here:
https://marc.info/?t=149760428900004&r=1&w=2

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-07-10 10:30:41 -07:00
Linus Torvalds
6618a24ab2 Merge branch 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
 "This fixes a user-visible bug introduced by the nowait-aio patches
  merged in this cycle"

* 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: nowait aio: Correct assignment of pos
2017-07-10 10:27:48 -07:00
Linus Torvalds
1d07b6cb96 Merge branch 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull copy*_iter fix from Al Viro.

[ Al used entirely the wrong return value. Oopsie. ]

* 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix brown paperbag bug in inlined copy_..._iter()
2017-07-10 10:13:45 -07:00
Linus Torvalds
a91ab911df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID updates from Jiri Kosina:

 - open/close tracking improvements from Dmitry Torokhov

 - battery support improvements in Wacom driver from Jason Gerecke

 - Win8 support fixes from Benjamin Tissories and Hans de Geode

 - misc fixes to Intel-ISH driver from Arnd Bergmann

 - support for quite a few new devices and small assorted fixes here and
   there

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (35 commits)
  HID: intel-ish-hid: Enable Gemini Lake ish driver
  HID: intel-ish-hid: Enable Cannon Lake ish driver
  HID: wacom: fix mistake in printk
  HID: multitouch: optimize the sticky fingers timer
  HID: multitouch: fix rare Win 8 cases when the touch up event gets missing
  HID: multitouch: use BIT macro
  HID: Add driver for Retrode2 joypad adapter
  HID: multitouch: Add support for Google Rose Touchpad
  HID: multitouch: Support PTP Stick and Touchpad device
  HID: core: don't use negative operands when shift
  HID: apple: Use country code to detect ISO keyboards
  HID: remove no longer used hid->open field
  greybus: hid: remove custom locking from gb_hid_open/close
  HID: usbhid: remove custom locking from usbhid_open/close
  HID: i2c-hid: remove custom locking from i2c_hid_open/close
  HID: serialize hid_hw_open and hid_hw_close
  HID: usbhid: do not rely on hid->open when deciding to do IO
  HID: hiddev: use hid_hw_power instead of usbhid_get/put_power
  HID: hiddev: use hid_hw_open/close instead of usbhid_open/close
  HID: asus: Add support for Zen AiO MD-5110 keyboard
  ...
2017-07-10 09:22:48 -07:00
Max Gurtovoy
c2f30f08c1 nvmet: avoid unneeded assignment of submit_bio return value
We actually using the cookie returned from the last submit_bio
call.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-07-10 18:45:38 +03:00
Lorenzo Pieralisi
f01fc41773 ARM/PCI: Fix pcibios_init_resource() struct pci_host_bridge leak
Since commit 97ad2bdcbe ("ARM/PCI: Convert PCI scan API to
pci_scan_root_bus_bridge()") the space for struct pci_sys_data is allocated
by pci_alloc_host_bridge() as part of the struct pci_host_bridge.

Therefore, failure paths must deallocate the entire pci_host_bridge by
using pci_free_host_bridge().

Fixes: 97ad2bdcbe ("ARM/PCI: Convert PCI scan API to pci_scan_root_bus_bridge()")
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Andrew Lunn <andrew@lunn.ch>
2017-07-10 09:33:14 -05:00
Takashi Iwai
85dc0f8554 ALSA: pcm: Simplify check for dma_mmap_coherent() availability
We check the availability of dma_mmap_coherent() in hw_support_mmap()
but with an ugly ifdef of lots of arch-checks.  Now we have a nice
CONFIG_ARCH_NO_COHERENT_DMA_MMAP kconfig, and this can be used
together with CONFIG_HAS_DMA check for a cleaner and more
comprehensive check.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-07-10 16:05:58 +02:00
Geert Uytterhoeven
abe594c2cf ALSA: pcm: Protect call to dma_mmap_coherent() by check for HAS_DMA
If NO_DMA=y:

    sound/core/pcm_native.o: In function `snd_pcm_lib_default_mmap':
    pcm_native.c:(.text+0x144c): undefined reference to `bad_dma_ops'
    pcm_native.c:(.text+0x1474): undefined reference to `dma_common_mmap'

Add a check for HAS_DMA to fix this.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-07-10 16:04:08 +02:00