migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.
If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.
This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.
Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.
This program exercises the anon_vma_moveto_tail:
===
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");
return 0;
}
===
$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
You can now use it on all perf tools, such as:
perf record -e probe:anon_vma_moveto_tail -aR sleep 1
$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pawel Sikora <pluto@agmk.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.
This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.
But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones. And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list. This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.
Enter per-zone dirty limits. They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place. As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.
As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon. The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.
At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.
For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation. Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.
Test results
15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000
fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000
btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000
ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.
This patch:
The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.
The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:
+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+
This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.
Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.
But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.
[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calling alloc_pages_exact_node() means the allocation only passes the
zonelist of a single node into the page allocator. If that node isn't
online, it's zonelist may never have been initialized causing a strange
oops that may not immediately be clear.
I recently debugged an issue where node 0 wasn't online and an allocator
was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
pointer on zonelist->_zoneref. If CONFIG_DEBUG_VM is enabled, though, it
would be nice to catch this a bit earlier.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
on access (read,write) to an unallocated page, which permits us to catch
code which corrupts memory. However the kernel is trying to maximise
memory usage, hence there are usually few free pages in the system and
buggy code usually corrupts some crucial data.
This patch changes the buddy allocator to keep more free/protected pages
and to interlace free/protected and allocated pages to increase the
probability of catching corruption.
When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
debug_guardpage_minorder defines the minimum order used by the page
allocator to grant a request. The requested size will be returned with
the remaining pages used as guard pages.
The default value of debug_guardpage_minorder is zero: no change from
current behaviour.
[akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can place this in definitions that we expect the compiler to remove by
dead code elimination. If this assertion fails, we get a nice error
message at build time.
The GCC function attribute error("message") was added in version 4.3, so
we define a new macro __linktime_error(message) to expand to this for
GCC-4.3 and later. This will give us an error diagnostic from the
compiler on the line that fails. For other compilers
__linktime_error(message) expands to nothing, and we have to be content
with a link time error, but at least we will still get a build error.
BUILD_BUG() expands to the undefined function __build_bug_failed() and
will fail at link time if the compiler ever emits code for it. On GCC-4.3
and later, attribute((error())) is used so that the failure will be noted
at compile time instead.
Signed-off-by: David Daney <david.daney@cavium.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: DM <dm.n9107@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Cross reported;
Under the following conditions, __alloc_pages_slowpath can loop forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order <= PAGE_ALLOC_COSTLY_ORDER
These conditions happen very often during suspend and resume,
when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
allocations into __GFP_WAIT.
The oom killer is not run because gfp_mask & __GFP_FS is false,
but should_alloc_retry will always return true when order is less
than PAGE_ALLOC_COSTLY_ORDER.
In his fix, he avoided retrying the allocation if reclaim made no progress
and __GFP_FS was not set. The problem is that this would result in
GFP_NOIO allocations failing that previously succeeded which would be very
unfortunate.
The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
behave like GFP_NOIO is that normally flushers will be cleaning pages and
kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
The same does not necessarily apply during suspend as the storage device
may be suspended.
This patch special cases the suspend case to fail the page allocation if
reclaim cannot make progress and adds some documentation on how
gfp_allowed_mask is currently used. Failing allocations like this may
cause suspend to abort but that is better than a livelock.
[mgorman@suse.de: Rework fix to be suspend specific]
[rientjes@google.com: Move suspended device check to should_alloc_retry]
Reported-by: Colin Cross <ccross@android.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ext4 commits for 3.3 merge window
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
ext4: fix undefined behavior in ext4_fill_flex_info()
ext4: make more symbols static
ext4: make local symbol ext4_initxattrs static
jbd2: fix hung processes in jbd2_journal_lock_updates()
ext4: reserve new feature flag codepoints
ext4: Report max_batch_time option correctly
ext4: add missing ext4_resize_end on error paths
ext4: let ext4_group_add() use common code
ext4: let ext4_group_extend() use common code
ext4: add new online resize interface
ext4: add a new function which adds a flex group to a fs
ext4: add a new function which allocates bitmaps and inode tables
ext4: pass verify_reserved_gdb() the number of group decriptors
ext4: add a function which updates the super block during online resizing
ext4: add a function which sets up a block group descriptors of a flex bg
ext4: add a function which sets up group blocks of a flex bg
ext4: add a structure which will be used by 64bit-resize interface
ext4: add a function which adds a new group descriptors to a fs
ext4: add a function which extends a group without checking parameters
ext4: use proper little-endian bitops
...
* 'nfs-for-3.3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Change the default setting of the nfs4_disable_idmapping parameter
NFSv4: Save the owner/group name string when doing open
NFS: Remove pNFS bloat from the generic write path
pnfs-obj: Must return layout on IO error
pnfs-obj: pNFS errors are communicated on iodata->pnfs_error
NFS: Cache state owners after files are closed
NFS: Clean up nfs4_find_state_owners_locked()
NFSv4: include bitmap in nfsv4 get acl data
nfs: fix a minor do_div portability issue
NFSv4.1: cleanup comment and debug printk
NFSv4.1: change nfs4_free_slot parameters for dynamic slots
NFSv4.1: cleanup init and reset of session slot tables
NFSv4.1: fix backchannel slotid off-by-one bug
nfs: fix regression in handling of context= option in NFSv4
NFS - fix recent breakage to NFS error handling.
NFS: Retry mounting NFSROOT
SUNRPC: Clean up the RPCSEC_GSS service ticket requests
MTD pull for 3.3
* tag 'for-linus-3.3' of git://git.infradead.org/mtd-2.6: (113 commits)
mtd: Fix dependency for MTD_DOC200x
mtd: do not use mtd->block_markbad directly
logfs: do not use 'mtd->block_isbad' directly
mtd: introduce mtd_can_have_bb helper
mtd: do not use mtd->suspend and mtd->resume directly
mtd: do not use mtd->lock, unlock and is_locked directly
mtd: do not use mtd->sync directly
mtd: harmonize mtd_writev usage
mtd: do not use mtd->lock_user_prot_reg directly
mtd: mtd->write_user_prot_reg directly
mtd: do not use mtd->read_*_prot_reg directly
mtd: do not use mtd->get_*_prot_info directly
mtd: do not use mtd->read_oob directly
mtd: mtdoops: do not use mtd->panic_write directly
romfs: do not use mtd->get_unmapped_area directly
mtd: do not use mtd->get_unmapped_area directly
mtd: do use mtd->point directly
mtd: introduce mtd_has_oob helper
mtd: mtdcore: export symbols cleanup
mtd: clean-up the default_mtd_writev function
...
Fix up trivial edit/remove conflict in drivers/staging/spectra/lld_mtd.c
* 'drm-core-next' of git://people.freedesktop.org/~airlied/linux: (307 commits)
drm/nouveau/pm: fix build with HWMON off
gma500: silence gcc warnings in mid_get_vbt_data()
drm/ttm: fix condition (and vs or)
drm/radeon: double lock typo in radeon_vm_bo_rmv()
drm/radeon: use after free in radeon_vm_bo_add()
drm/sis|via: don't return stack garbage from free_mem ioctl
drm/radeon/kms: remove pointless CS flags priority struct
drm/radeon/kms: check if vm is supported in VA ioctl
drm: introduce drm_can_sleep and use in intel/radeon drivers. (v2)
radeon: Fix disabling PCI bus mastering on big endian hosts.
ttm: fix agp since ttm tt rework
agp: Fix multi-line warning message whitespace
drm/ttm/dma: Fix accounting error when calling ttm_mem_global_free_page and don't try to free freed pages.
drm/ttm/dma: Only call set_pages_array_wb when the page is not in WB pool.
drm/radeon/kms: sync across multiple rings when doing bo moves v3
drm/radeon/kms: Add support for multi-ring sync in CS ioctl (v2)
drm/radeon: GPU virtual memory support v22
drm: make DRM_UNLOCKED ioctls with their own mutex
drm: no need to hold global mutex for static data
drm/radeon/benchmark: common modes sweep ignores 640x480@32
...
Fix up trivial conflicts in radeon/evergreen.c and vmwgfx/vmwgfx_kms.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (64 commits)
Input: tc3589x-keypad - add missing kerneldoc
Input: ucb1400-ts - switch to using dev_xxx() for diagnostic messages
Input: ucb1400_ts - convert to threaded IRQ
Input: ucb1400_ts - drop inline annotations
Input: usb1400_ts - add __devinit/__devexit section annotations
Input: ucb1400_ts - set driver owner
Input: ucb1400_ts - convert to use dev_pm_ops
Input: psmouse - make sure we do not use stale methods
Input: evdev - do not block waiting for an event if fd is nonblock
Input: evdev - if no events and non-block, return EAGAIN not 0
Input: evdev - only allow reading events if a full packet is present
Input: add driver for pixcir i2c touchscreens
Input: samsung-keypad - implement runtime power management support
Input: tegra-kbc - report wakeup key for some platforms
Input: tegra-kbc - add device tree bindings
Input: add driver for AUO In-Cell touchscreens using pixcir ICs
Input: mpu3050 - configure the sampling method
Input: mpu3050 - ensure we enable interrupts
Input: mpu3050 - add of_match table for device-tree probing
Input: sentelic - document the latest hardware
...
Fix up fairly trivial conflicts (device tree matching conflicting with
some independent cleanups) in drivers/input/keyboard/samsung-keypad.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (68 commits)
hid-input/battery: add FEATURE quirk
hid-input/battery: remove battery_val
hid-input/battery: power-supply type really *is* a battery
hid-input/battery: make the battery setup common for INPUTs and FEATUREs
hid-input/battery: deal with both FEATURE and INPUT report batteries
hid-input/battery: add quirks for battery
hid-input/battery: remove apparently redundant kmalloc
hid-input: add support for HID devices reporting Battery Strength
HID: hid-multitouch: add support 9 new Xiroku devices
HID: multitouch: add support for 3M 32"
HID: multitouch: add support of Atmel multitouch panels
HID: usbhid: defer LED setting to a workqueue
HID: usbhid: hid-core: submit queued urbs before suspend
HID: usbhid: remove LED_ON
HID: emsff: use symbolic name instead of hardcoded PID constant
HID: Enable HID_QUIRK_MULTI_INPUT for Trio Linker Plus II
HID: Kconfig: fix syntax
HID: introduce proper dependency of HID_BATTERY on POWER_SUPPLY
HID: multitouch: support PixArt optical touch screen
HID: make parser more verbose about parsing errors by default
...
Fix up rename/delete conflict in drivers/hid/hid-hyperv.c (removed in
staging, moved in this branch) and similarly for the rules for same file
in drivers/staging/hv/{Kconfig,Makefile}.
* git://www.linux-watchdog.org/linux-watchdog:
watchdog: omap_wdt.c: fix the WDIOC_GETBOOTSTATUS ioctl if not implemented.
watchdog: new driver for VIA chipsets
watchdog: ath79_wdt: flush register writes
drivers/watchdog/lantiq_wdt.c: drop iounmap for devm_ allocated data
watchdog: documentation: describe nowayout in coversion-guide
watchdog: documentation: update index file
watchdog: Convert wm831x driver to devm_kzalloc()
watchdog: add nowayout helpers to Watchdog Timer Driver Kernel API
watchdog: convert drivers/watchdog/* to use module_platform_driver()
watchdog: Use DEFINE_SPINLOCK() for static spinlocks
watchdog: Convert Wolfson drivers to module_platform_driver
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (40 commits)
regulator: set constraints.apply_uV to 0 in of_get_fixed_voltage_config
regulator: max8925: fix enabled/disabled judgement mistake
regulator: add regulator_bulk_force_disable function
regulator: pass regulator_register of_node in fixed voltage driver
regulator: add regulator_force_disable() definition for !CONFIG_REGULATOR
regulator: Enable supply regulator if child rail is enabled.
regulator: mc13892: Convert to devm_kzalloc()
regulator: mc13783: Convert to devm_kzalloc()
regulator: Fix checking return value of create_regulator
regulator: Fix the error handling if create_regulator fails
regulator: Export regulator_is_supported_voltage()
regulator: mc13892: add device tree probe support
regulator: mc13892: remove the unnecessary prefix from regulator name
regulator: Convert wm831x regulator drivers to devm_kzalloc()
regulator: da9052: Staticize non-exported symbols
regulator: Replace kzalloc with devm_kzalloc and if-else with a switch-case for da9052-regulator
regulator: Update da9052-regulator for DT changes
regulator: DA9052/53 Regulator support
regulator: pass device_node to of_get_regulator_init_data()
regulator: If a single voltage is set with device tree then set apply_uV
...
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (31 commits)
pinctrl: remove unnecessary max pin number
pinctrl: correct a offset while enumerating pins
pinctrl: some typo fixes
pinctrl: rename U300 and SIRF pin controllers
pinctrl: pass name instead of device to pin_config_*
pinctrl: add "struct seq_file;" to pinconf.h
pinctrl: conjure names for unnamed pins
pinctrl: add a group-specific hog macro
pinctrl: don't create a device for each pin controller
arm/u300: don't use PINMUX_MAP_PRIMARY*
pinctrl: implement PINMUX_MAP_SYS_HOG
pinctrl: add a pin config interface
pinctrl/coh901: driver to request its pins
pinctrl: u300-pinmux: register proper GPIO ranges
pinctrl: move the U300 GPIO driver to pinctrl
ARM: u300: localize GPIO assignments
pinctrl: make it possible to add multiple maps
pinctrl: make a copy of pinmux map
pinctrl: GPIO direction support for muxing
pinctrl: print pin range in GPIO range debugs
...
* 'upstream-linus' of git://github.com/jgarzik/libata-dev:
ahci: support the STA2X11 I/O Hub
pata_bf54x: fix BMIDE status register emulation
ata: add ata port hibernate callbacks
ata: update ata port's runtime status during system resume
[SCSI] runtime resume parent for child's system-resume
ahci: platform support for suspend/resume
libata-core: kill duplicate statement in ata_do_set_mode()
pata_of_platform: remove direct dependency on OF_IRQ
SATA/PATA: convert drivers/ata/* to use module_platform_driver()
pata_cs5536: forward port changes from cs5536
libata-sff: use ATAPI_{COD|IO}
ata: add ata port runtime PM callbacks
ata: add ata port system PM callbacks
[SCSI] sd: check runtime PM status in sd_shutdown
[SCSI] check runtime PM status in system PM
[SCSI] add flag to skip the runtime PM calls on the host
ata: make ata port as parent device of scsi host
ahci: start engine only during soft/hard resets
Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
cause shrink_dcache_parent() to loop forever.
Here's what appears to happen:
1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1
2 - CPU1: select_parent(P) locks P->d_lock
3 - CPU0: shrink_dentry_list() locks C->d_lock
dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock
4 - CPU1: select_parent(P) locks C->d_lock,
moves C from dispose list being processed on CPU0 to the new
dispose list, returns 1
5 - CPU0: shrink_dentry_list() finds dispose list empty, returns
6 - Goto 2 with CPU0 and CPU1 switched
Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
it found a new one, causing shrink_dentry_list() to think it's making progress
and loop over and over.
One way to trigger this is to make udev calls stat() on the sysfs file while it
is going away.
Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:
ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"
Then execute the following loop:
while true; do
echo -bond0 > /sys/class/net/bonding_masters
echo +bond0 > /sys/class/net/bonding_masters
echo -bond1 > /sys/class/net/bonding_masters
echo +bond1 > /sys/class/net/bonding_masters
done
One fix would be to check all callers and prevent concurrent calls to
shrink_dcache_parent(). But I think a better solution is to stop the
stealing behavior.
This patch adds a new dentry flag that is set when the dentry is added to the
dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a
new reference just before being pruned.
If the dentry has this flag, select_parent() will skip it and let
shrink_dentry_list() retry pruning it. With select_parent() skipping those
dentries there will not be the appearance of progress (new dentries found) when
there is none, hence shrink_dcache_parent() will not loop forever.
Set the flag is also set in prune_dcache_sb() for consistency as suggested by
Linus.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'kvm-updates/3.3' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (74 commits)
KVM: PPC: Whitespace fix for kvm.h
KVM: Fix whitespace in kvm_para.h
KVM: PPC: annotate kvm_rma_init as __init
KVM: x86 emulator: implement RDPMC (0F 33)
KVM: x86 emulator: fix RDPMC privilege check
KVM: Expose the architectural performance monitoring CPUID leaf
KVM: VMX: Intercept RDPMC
KVM: SVM: Intercept RDPMC
KVM: Add generic RDPMC support
KVM: Expose a version 2 architectural PMU to a guests
KVM: Expose kvm_lapic_local_deliver()
KVM: x86 emulator: Use opcode::execute for Group 9 instruction
KVM: x86 emulator: Use opcode::execute for Group 4/5 instructions
KVM: x86 emulator: Use opcode::execute for Group 1A instruction
KVM: ensure that debugfs entries have been created
KVM: drop bsp_vcpu pointer from kvm struct
KVM: x86: Consolidate PIT legacy test
KVM: x86: Do not rely on implicit inclusions
KVM: Make KVM_INTEL depend on CPU_SUP_INTEL
KVM: Use memdup_user instead of kmalloc/copy_from_user
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: new helper - d_make_root()
dcache: use a dispose list in select_parent
ceph: d_alloc_root() may fail
ext4: fix failure exits
isofs: inode leak on mount failure
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
igmp: Avoid zero delay when receiving odd mixture of IGMP queries
netdev: make net_device_ops const
bcm63xx: make ethtool_ops const
usbnet: make ethtool_ops const
net: Fix build with INET disabled.
net: introduce netif_addr_lock_nested() and call if when appropriate
net: correct lock name in dev_[uc/mc]_sync documentations.
net: sk_update_clone is only used in net/core/sock.c
8139cp: fix missing napi_gro_flush.
pktgen: set correct max and min in pktgen_setup_inject()
smsc911x: Unconditionally include linux/smscphy.h in smsc911x.h
asix: fix infinite loop in rx_fixup()
net: Default UDP and UNIX diag to 'n'.
r6040: fix typo in use of MCR0 register bits
net: fix sock_clone reference mismatch with tcp memcontrol
clock management changes for i.MX
Another simple series related to clock management, this time only for
imx.
* tag 'clk' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: mxs: select HAVE_CLK_PREPARE for clock
clk: add config option HAVE_CLK_PREPARE into Kconfig
ASoC: mxs-saif: convert to clk_prepare/clk_unprepare
video: mxsfb: convert to clk_prepare/clk_unprepare
serial: mxs-auart: convert to clk_prepare/clk_unprepare
net: flexcan: convert to clk_prepare/clk_unprepare
mtd: gpmi-lib: convert to clk_prepare/clk_unprepare
mmc: mxs-mmc: convert to clk_prepare/clk_unprepare
dma: mxs-dma: convert to clk_prepare/clk_unprepare
net: fec: add clk_prepare/clk_unprepare
ARM: mxs: convert platform code to clk_prepare/clk_unprepare
clk: add helper functions clk_prepare_enable and clk_disable_unprepare
Fix up trivial conflicts in drivers/net/ethernet/freescale/fec.c due to
commit 0ebafefcaa ("net: fec: add clk_prepare/clk_unprepare") clashing
trivially with commit e163cc97f9 ("net/fec: fix the .remove code").
Driver specific changes
Again, a lot of platforms have changes in here: pxa, samsung, omap,
at91, imx, ...
* tag 'drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (54 commits)
ARM: sa1100: clean up of the clock support
ARM: pxa: add dummy clock for sa1100-rtc
RTC: sa1100: support sa1100, pxa and mmp soc families
RTC: sa1100: remove redundant code of setting alarm
RTC: sa1100: Clean out ost register
Input: zylonite-wm97xx - replace IRQ_GPIO() with gpio_to_irq()
pcmcia: pxa: replace IRQ_GPIO() with gpio_to_irq()
ARM: EXYNOS: Modified files for SPI consolidation work
ARM: S5P64X0: Enable SDHCI support
ARM: S5P64X0: Add lookup of sdhci-s3c clocks using generic names
ARM: S5P64X0: Add HSMMC setup for host Controller
ARM: EXYNOS: Add USB OHCI support to ORIGEN board
USB: Add Samsung Exynos OHCI diver
ARM: EXYNOS: Add USB OHCI support to SMDKV310 board
ARM: EXYNOS: Add USB OHCI device
net: macb: fix build break with !CONFIG_OF
i2c: tegra: Support DVC controller in device tree
i2c: tegra: Add __devinit/exit to probe/remove
net/at91_ether: use gpio_is_valid for phy IRQ line
ARM: at91/net: add macb ethernet controller in 9g45/9g20 DT
...
New feature development
This adds support for new features, and contains stuff from most
platforms. A number of these patches could have fit into other
branches, too, but were small enough not to cause too much
confusion here.
* tag 'devel' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (28 commits)
mfd/db8500-prcmu: remove support for early silicon revisions
ARM: ux500: fix the smp_twd clock calculation
ARM: ux500: remove support for early silicon revisions
ARM: ux500: update register files
ARM: ux500: register DB5500 PMU dynamically
ARM: ux500: update ASIC detection for U5500
ARM: ux500: support DB8520
ARM: picoxcell: implement watchdog restart
ARM: OMAP3+: hwmod data: Add the default clockactivity for I2C
ARM: OMAP3: hwmod data: disable multiblock reads on MMC1/2 on OMAP34xx/35xx <= ES2.1
ARM: OMAP: USB: EHCI and OHCI hwmod structures for OMAP4
ARM: OMAP: USB: EHCI and OHCI hwmod structures for OMAP3
ARM: OMAP: hwmod data: Add support for AM35xx UART4/ttyO3
ARM: Orion: Remove address map info from all platform data structures
ARM: Orion: Get address map from plat-orion instead of via platform_data
ARM: Orion: mbus_dram_info consolidation
ARM: Orion: Consolidate the address map setup
ARM: Kirkwood: Add configuration for MPP12 as GPIO
ARM: Kirkwood: Recognize A1 revision of 6282 chip
ARM: ux500: update the MOP500 GPIO assignments
...
Device tree conversions for samsung and tegra
Both platforms had some initial device tree support, but this adds
much more to actually make it usable.
* tag 'dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (45 commits)
ARM: dts: Add intial dts file for EXYNOS4210 SoC, SMDKV310 and ORIGEN
ARM: EXYNOS: Add Exynos4 device tree enabled board file
rtc: rtc-s3c: Add device tree support
input: samsung-keypad: Add device tree support
ARM: S5PV210: Modify platform data for pl330 driver
ARM: S5PC100: Modify platform data for pl330 driver
ARM: S5P64x0: Modify platform data for pl330 driver
ARM: EXYNOS: Add a alias for pdma clocks
ARM: EXYNOS: Limit usage of pl330 device instance to non-dt build
ARM: SAMSUNG: Add device tree support for pl330 dma engine wrappers
DMA: PL330: Add device tree support
ARM: EXYNOS: Modify platform data for pl330 driver
DMA: PL330: Infer transfer direction from transfer request instead of platform data
DMA: PL330: move filter function into driver
serial: samsung: Fix build for non-Exynos4210 devices
serial: samsung: add device tree support
serial: samsung: merge probe() function from all SoC specific extensions
serial: samsung: merge all SoC specific port reset functions
ARM: SAMSUNG: register uart clocks to clock lookup list
serial: samsung: remove all uses of get_clksrc and set_clksrc
...
Fix up fairly trivial conflicts in arch/arm/mach-s3c2440/clock.c and
drivers/tty/serial/Kconfig both due to just adding code close to
changes.
Cleanups on various subarchitectures
Cleanup patches for various ARM platforms and some of their associated
drivers, the bulk of these is for mach-91.
Arnd ended up pulling in the restart branch from Russell in order to
fix up some simple but annoying merge conflicts.
* tag 'cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (44 commits)
arm/at91: fix build of stamp9g20
ARM: u300: delete memory.h
MAINTAINERS: add maintainer entry for Picochip picoxcell
ARM: picoxcell: move io mappings to common.c
ARM: picoxcell: don't reserve irq_descs
ARM: picoxcell: remove mach/memory.h
ARM: at91: delete the pcontrol_g20_defconfig
arm/tegra: Remove code that's ifndef CONFIG_ARM_GIC
arm/tegra: remove unused defines
arm/tegra: fix variable formatting in makefile
ARM: davinci: vpif: move code to driver core header from platform
ARM: at91/gpio: fix display of number of irq setuped
ARM: at91/gpio: drop PIN_BASE
ARM: at91/udc: use gpio_is_valid to check the gpio
ARM: at91/ohci: use gpio_is_valid to check the gpio
ARM: at91/nand: use gpio_is_valid to check the gpio
ARM: at91/mmc: use gpio_is_valid to check the gpio
ARM: at91/ide: use gpio_is_valid to check the gpio
ARM: at91/pata: use gpio_is_valid to check the gpio
ARM: at91/soc: use gpio_is_valid to check the gpio
...
Including trace/events/*.h TRACE_EVENT() macro headers in other headers
can cause strange side effects if another trace/event/*.h header
includes that header. Having trace/events/kmem.h inside slab_def.h
caused a compile error in sparc64 when changes were done to some header
files. Moving the kmem.h trace header out of slab.h and into slab.c
fixes the problem.
Note, both slub.c and slob.c already include the trace/events/kmem.h
file. Only slab.c had it missing.
Link: http://lkml.kernel.org/r/20120105190405.1e3191fb5a43b2a0f1655e1f@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> net/core/sock.c: In function 'sk_update_clone':
> net/core/sock.c:1278:3: error: implicit declaration of function 'sock_update_memcg'
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: Remove irqsafe_cpu_xxx variants
Fix up conflict in arch/x86/include/asm/percpu.h due to clash with
cebef5beed ("x86: Fix and improve percpu_cmpxchg{8,16}b_double()")
which edited the (now removed) irqsafe_cpu_cmpxchg*_double code.
* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cgroup: fix to allow mounting a hierarchy by name
cgroup: move assignement out of condition in cgroup_attach_proc()
cgroup: Remove task_lock() from cgroup_post_fork()
cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
cgroup: only need to check oldcgrp==newgrp once
cgroup: remove redundant get/put of task struct
cgroup: remove redundant get/put of old css_set from migrate
cgroup: Remove unnecessary task_lock before fetching css_set on migration
cgroup: Drop task_lock(parent) on cgroup_fork()
cgroups: remove redundant get/put of css_set from css_set_check_fetched()
resource cgroups: remove bogus cast
cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
cgroup, cpuset: don't use ss->pre_attach()
cgroup: don't use subsys->can_attach_task() or ->attach_task()
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup: always lock threadgroup during migration
threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
...
Fix up conflict in kernel/cgroup.c due to commit e0197aae59: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2/3/4: delete unneeded includes of module.h
ext{3,4}: Fix potential race when setversion ioctl updates inode
udf: Mark LVID buffer as uptodate before marking it dirty
ext3: Don't warn from writepage when readonly inode is spotted after error
jbd: Remove j_barrier mutex
reiserfs: Force inode evictions before umount to avoid crash
reiserfs: Fix quota mount option parsing
udf: Treat symlink component of type 2 as /
udf: Fix deadlock when converting file from in-ICB one to normal one
udf: Cleanup calling convention of inode_getblk()
ext2: Fix error handling on inode bitmap corruption
ext3: Fix error handling on inode bitmap corruption
ext3: replace ll_rw_block with other functions
ext3: NULL dereference in ext3_evict_inode()
jbd: clear revoked flag on buffers before a new transaction started
ext3: call ext3_mark_recovery_complete() when recovery is really needed
dev_uc_sync() and dev_mc_sync() are acquiring netif_addr_lock for
destination device of synchronization. Since netif_addr_lock is
already held at the time for source device, this triggers lockdep
deadlock warning.
There's no way this deadlock can happen so use spin_lock_nested() to
silence the warning.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'staging-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (466 commits)
net/hyperv: Add support for jumbo frame up to 64KB
net/hyperv: Add NETVSP protocol version negotiation
net/hyperv: Remove unnecessary kmap_atomic in netvsc driver
staging/rtl8192e: Register against lib80211
staging/rtl8192e: Convert to lib80211_crypt_info
staging/rtl8192e: Convert to lib80211_crypt_data and lib80211_crypt_ops
staging/rtl8192e: Add lib80211.h to rtllib.h
staging/mei: add watchdog device registration wrappers
drm/omap: GEM, deal with cache
staging: vt6656: int.c, int.h: Change return of function to void
staging: usbip: removed unused definitions from header
staging: usbip: removed dead code from receive function
staging:iio: Drop {mark,unmark}_in_use callbacks
staging:iio: Drop buffer mark_param_change callback
staging:iio: Drop the unused buffer enable() and is_enabled() callbacks
staging:iio: Drop buffer busy flag
staging:iio: Make sure a device is only opened once at a time
staging:iio: Disallow modifying buffer size when buffer is enabled
staging:iio: Disallow changing scan elements in all buffered modes
staging:iio: Use iio_buffer_enabled instead of open coding it
...
Fix up conflict in drivers/staging/iio/adc/ad799x_core.c (removal of
module_init due to using module_i2c_driver() helper, next to removal of
MODULE_ALIAS due to using MODULE_DEVICE_TABLE instead).
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (65 commits)
tty: serial: imx: move del_timer_sync() to avoid potential deadlock
imx: add polled io uart methods
imx: Add save/restore functions for UART control regs
serial/imx: let probing fail for the dt case without a valid alias
serial/imx: propagate error from of_alias_get_id instead of using -ENODEV
tty: serial: imx: Allow UART to be a source for wakeup
serial: driver for m32 arch should not have DEC alpha errata
serial/documentation: fix documented name of DCD cpp symbol
atmel_serial: fix spinlock lockup in RS485 code
tty: Fix memory leak in virtual console when enable unicode translation
serial: use DIV_ROUND_CLOSEST instead of open coding it
serial: add support for 400 and 800 v3 series Titan cards
serial: bfin-uart: Remove ASYNC_CTS_FLOW flag for hardware automatic CTS.
serial: bfin-uart: Enable hardware automatic CTS only when CTS pin is available.
serial: make FSL errata depend on 8250_CONSOLE, not just 8250
serial: add irq handler for Freescale 16550 errata.
serial: manually inline serial8250_handle_port
serial: make 8250 timeout use the specified IRQ handler
serial: export the key functions for an 8250 IRQ handler
serial: clean up parameter passing for 8250 Rx IRQ handling
...
Instead, use the new 'mtd_can_have_bb()', or just rely on 'mtd_block_markbad()'
return code, which will be -EOPNOTSUPP if bad blocks are not supported.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This patch introduces new 'mtd_can_have_bb()' helper function which checks
whether the flash can have bad eraseblocks. Then it changes all the
direct 'mtd->block_isbad' use cases with 'mtd_can_have_bb()'.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Just call the 'mtd_suspend()' and 'mtd_resume()' - they will do nothing
if the operation is not defined.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Instead, call the corresponding MTD API function which will return
'-EOPNOTSUPP' if the operation is not supported.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This patch teaches 'mtd_sync()' to do nothing when the MTD driver does
not have the '->sync()' method, which allows us to remove all direct
'mtd->sync' accesses.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This patch makes the 'mtd_writev()' function more usable and logical. We first
teach it to fall-back to the 'default_mtd_writev()' function if the MTD driver
does not define its own '->writev()' method. Then we make block2mtd and JFFS2
just 'mtd_writev()' instead of 'default_mtd_writev()' function. This means we
can now stop exporting 'default_mtd_writev()' and instead, export
'mtd_writev()'. This is much cleaner and more logical, as well as allows us to
get read of another direct 'mtd->writev' access in JFFS2.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>