IO scheduling and cgroup are tied to the issuing task via io_context
and cgroup of %current. Unfortunately, there are cases where IOs need
to be routed via a different task which makes scheduling and cgroup
limit enforcement applied completely incorrectly.
For example, all bios delayed by blk-throttle end up being issued by a
delayed work item and get assigned the io_context of the worker task
which happens to serve the work item and dumped to the default block
cgroup. This is double confusing as bios which aren't delayed end up
in the correct cgroup and makes using blk-throttle and cfq propio
together impossible.
Any code which punts IO issuing to another task is affected which is
getting more and more common (e.g. btrfs). As both io_context and
cgroup are firmly tied to task including userland visible APIs to
manipulate them, it makes a lot of sense to match up tasks to bios.
This patch implements bio_associate_current() which associates the
specified bio with %current. The bio will record the associated ioc
and blkcg at that point and block layer will use the recorded ones
regardless of which task actually ends up issuing the bio. bio
release puts the associated ioc and blkcg.
It grabs and remembers ioc and blkcg instead of the task itself
because task may already be dead by the time the bio is issued making
ioc and blkcg inaccessible and those are all block layer cares about.
elevator_set_req_fn() is updated such that the bio elvdata is being
allocated for is available to the elevator.
This doesn't update block cgroup policies yet. Further patches will
implement the support.
-v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
rq_ioc() to fix build breakage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.
This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ioc_task_link() is used to share %current's ioc on clone. If
%current->io_context is set, %current is guaranteed to have refcount
on the ioc and, thus, ioc_task_link() can't fail.
Replace error checking in ioc_task_link() with WARN_ON_ONCE() and make
it just increment refcount and nr_tasks.
-v2: Description typo fix (Vivek).
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that blkg additions / removals are always done under both q and
blkcg locks, the only places RCU locking is necessary are
blkg_lookup[_create]() for lookup w/o blkcg lock. This patch drops
unncessary RCU locking replacing it with plain blkcg locking as
necessary.
* blkiocg_pre_destroy() already perform proper locking and don't need
RCU. Dropped.
* blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
lock. This isn't a hot path.
* Now unnecessary synchronize_rcu() from queue exit paths removed.
This makes q->nr_blkgs unnecessary. Dropped.
* RCU annotation on blkg->q removed.
-v2: Vivek pointed out that blkg_lookup_create() still needs to be
called under rcu_read_lock(). Updated.
-v3: After the update, stats_lock locking in blkio_read_blkg_stats()
shouldn't be using _irq variant as it otherwise ends up enabling
irq while blkcg->lock is locked. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the previous patch to move blkg list heads and counters to
request_queue and blkg, logic to manage them in both policies are
almost identical and can be moved to blkcg core.
This patch moves blkg link logic into blkg_lookup_create(), implements
common blkg unlink code in blkg_destroy(), and updates
blkg_destory_all() so that it's policy specific and can skip root
group. The updated blkg_destroy_all() is now used to both clear queue
for bypassing and elv switching, and release all blkgs on q exit.
This patch introduces a race window where policy [de]registration may
race against queue blkg clearing. This can only be a problem on cfq
unload and shouldn't be a real problem in practice (and we have many
other places where this race already exists). Future patches will
remove these unlikely races.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, specific policy implementations are responsible for
maintaining list and number of blkgs. This duplicates code
unnecessarily, and hinders factoring common code and providing blkcg
API with better defined semantics.
After this patch, request_queue hosts list heads and counters and blkg
has list nodes for both policies. This patch only relocates the
necessary fields and the next patch will actually move management code
into blkcg core.
Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to
have 2 elements. This is to avoid include dependency and will be
removed by the next patch.
This patch doesn't introduce any behavior change.
-v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed
as pointed out by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keep track of all request_queues which have blkcg initialized and turn
on bypass and invoke blkcg_clear_queue() on all before making changes
to blkcg policies.
This is to prepare for moving blkg management into blkcg core. Note
that this uses more brute force than necessary. Finer grained shoot
down will be implemented later and given that policy [un]registration
almost never happens on running systems (blk-throtl can't be built as
a module and cfq usually is the builtin default iosched), this
shouldn't be a problem for the time being.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename and extend elv_queisce_start/end() to
blk_queue_bypass_start/end() which are exported and supports nesting
via @q->bypass_depth. Also add blk_queue_bypass() to test bypass
state.
This will be further extended and used for blkio_group management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elevator_ops->elevator_init_fn() has a weird return value. It returns
a void * which the caller should assign to q->elevator->elevator_data
and %NULL return denotes init failure.
Update such that it returns integer 0/-errno and sets elevator_data
directly as necessary.
This makes the interface more conventional and eases further cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The SET_PORT functions are implemented in port.c, which is part
of mlx4_core, these functions are exported. The functions are in use by
the mlx4_en module (were originally part of mlx4_en).
Their declaration remained in mlx4_en module, moving the declaration to the right location.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows to check if the other core is in WFI
mode. It is the last check the idle routine has to do before
entering into the retention state.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In the case we go to the retention mode, we decoupled the gic
in order to have the A9 core to reach a stable WFI state.
But we want the prcmu to wake up the A9 when the gic has a pending
irq which is done by copying the gic settings to the to the prcmu.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This patch introduces a routine to check if there are some
irqs pending on the gic. Usually this check is not relevant because
it appears racy (an irq can arrive right after this check), but in
the ux500 it makes sense because the prcmu decouples the gic from
the A9 cores.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The reference group and internal oscillator are shared by sub-devs
like led, backlight and vibrator in PM8606 chip. Now introduce a
voting mechanism to enable/disable it.
Add pm8606_osc_enable() and pm8606_osc_disable() interface and
related defines to support this. This interface will be called by
vibrator led and backlight driver.The refernce group and internal
oscillator are enabled only when at least one of it's clients holds
it on or disabled only all the clients don't use it any more based
on the above mechanism.
Signed-off-by: Jett.Zhou <jtzhou@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Using regmap apis for accessing the device registers and
using RBTREE caching mechanims for caching registers.
Enabling caching of the registers which is used for voltage
controls. By doing this, the modify_bits operation is faster as
it does not involve the i2c register read from device, just read
from cache. This results faster set voltage operation.
Signed-off-by: Laxman Dewangan <ldewangan@nvidia.com>
Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
For 88pm860x pmic, it can wake the system from low power mode by irq,
its sub-devs like RTC and onkey can be enabled for this usage.
Signed-off-by: Jett.Zhou <jtzhou@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
da9052 has been converted to use regmap API, so we can remove the unused
io_lock mutex.
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This patch allows to decouple and recouple the gic from the PRCMU.
This is needed to put the A9 core in retention mode with the cpuidle
driver.
It is based on top of the "DB8500 PRCMU update" patchset.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Rickard Andersson <rickard.andersson@stericsson.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
There are currently four different versions of the AB8500
around: AB8500, AB8505, AB9540 and AB8540. Unfortunately:
- Some of the chips (AB8500, AB8505, AB9540) cannot read
the AB8500_REV_REG register but return errors
- Some of them have the same ID value in the hardware
register AB8500_REV_REV, for example the first versions
of AB8505 and AB9540 have 0xFF in this register -
just like the AB8500.
So we need to be able to enforce a certain version from
the platform. We do this by using the id of the platform
device that provides the read/write functions.
Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Maxime Coquelin <maxime.coquelin@stericsson.com>
Signed-off-by: Alex Macro <alex.macro@stericsson.com>
Signed-off-by: Michel Jaouen <michel.jaouen@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This patch adds an initial PRCMU register access API, which
for now should only be used for a very limited set of registers.
The idea about this API is that we split the PRCMU driver in
one part that deals with interaction with the PRCMU firmware
and one part that simply provide write accessors in the PRCMU
register range. The latter are just a collection of registers
exposed in the PRCMU register range for various purposes and
not related to the PRCMU firmware.
Currently we support some limited GPIO, SPI and UART settings
through this API.
Signed-off-by: Mattias Nilsson <mattias.i.nilsson@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This updates the clock handling in the DB8500 PRCMU driver with
the latest findings and API changes related to changes in the
backing firmware in the PRCMU.
- Add the necessary interfaces to get the frequencies of the
clocks and set the rate of some of the clocks.
- Add support for controlling the clocks PLLSOC0, PLLDSI,
DSI0, DSI1 and DSI escape clocks (DSInESCCLK).
- Correct the PLLSDI enable/disable sequence by using the
DSIPLL_CLAMPI bit.
After this we will have the interfaces and code to implement the
U8500 clock framework properly.
Reviewed-by: Jonas Aberg <jonas.aberg@stericsson.com>
Signed-off-by: Mattias Nilsson <mattias.i.nilsson@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This prefixes a number of accessor functions with db8500_* since
they are DB8500-specific and we need to move to this naming
scheme.
We also replace numerous instances of machine_is() with cpu_is()
which covers the right type of ASICs rather than entire machines
i.e. boards.
Signed-off-by: Mattias Nilsson <mattias.i.nilsson@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
MC13783 can be programmed to wait some clock cycles between the
touchscreen polarization and the resistance conversion. This is
needed to adjust for touchscreens with high capacitance between
plates.
Signed-off-by: Michael Thalmeier <michael.thalmeier@hale.at>
Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The TPS65217 chip is a power management IC for Portable Navigation Systems
and Tablet Computing devices. It contains the following components:
- Regulators
- White LED
- USB battery charger
This patch adds support for tps65217 mfd device. At this time only
the regulator functionality is made available.
Signed-off-by: AnilKumar Ch <anilkumar@ti.com>
Reviwed-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The resource table is an array of 'struct fw_resource' members, where
each resource entry is expressed as a single member of that array.
This approach got us this far, but it has a few drawbacks:
1. Different resource entries end up overloading the same members of 'struct
fw_resource' with different meanings. The resulting code is error prone
and hard to read and maintain.
2. It's impossible to extend 'struct fw_resource' without breaking the
existing firmware images (and we already want to: we can't introduce the
new virito device resource entry with the current scheme).
3. It doesn't scale: 'struct fw_resource' must be as big as the largest
resource entry type. As a result, smaller resource entries end up
utilizing only small part of it.
This is fixed by defining a dedicated structure for every resource type,
and then converting the resource table to a list of type-value members.
Instead of a rigid array of homogeneous structs, the resource table
is turned into a collection of heterogeneous structures.
This way:
1. Resource entries consume exactly the amount of bytes they need.
2. It's easy to extend: just create a new resource entry structure, and assign
it a new type.
3. The code is easier to read and maintain: the structures' members names are
meaningful.
While we're at it, this patch has several other resource table changes:
1. The resource table gains a simple header which contains the
number of entries in the table and their offsets within the table. This
makes the parsing code simpler and easier to read.
2. A version member is added to the resource table. Should we change the
format again, we'll bump up this version to prevent breakage with
existing firmware images.
3. The VRING and VIRTIO_DEV resource entries are combined to a single
VDEV entry. This paves the way to supporting multiple VDEV entries.
4. Since we don't really support 64-bit rprocs yet, convert two stray u64
members to u32.
Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>
Cc: Brian Swetland <swetland@google.com>
Cc: Iliyan Malchev <malchev@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mark Grosen <mgrosen@ti.com>
Cc: John Williams <john.williams@petalogix.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Loic PALLARDY <loic.pallardy@stericsson.com>
Cc: Ludovic BARRE <ludovic.barre@stericsson.com>
Cc: Omar Ramirez Luna <omar.luna@linaro.org>
Cc: Guzman Lugo Fernando <fernando.lugo@ti.com>
Cc: Anna Suman <s-anna@ti.com>
Cc: Clark Rob <rob@ti.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Saravana Kannan <skannan@codeaurora.org>
Cc: David Brown <davidb@codeaurora.org>
Cc: Kieran Bingham <kieranbingham@gmail.com>
Cc: Tony Lindgren <tony@atomide.com>
Replace the union with the common struct stateid4 as defined in both
RFC3530 and RFC5661. This makes it easier to access the sequence id,
which will again make implementing support for parallel OPEN calls
easier.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Conflicts:
drivers/net/vmxnet3/vmxnet3_drv.c
Small vmxnet3 conflict with header size bug fix in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge the emailed seties of 19 patches from Andrew Morton
* akpm:
rapidio/tsi721: fix queue wrapping bug in inbound doorbell handler
memcg: fix mapcount check in move charge code for anonymous page
mm: thp: fix BUG on mm->nr_ptes
alpha: fix 32/64-bit bug in futex support
memcg: fix GPF when cgroup removal races with last exit
debugobjects: Fix selftest for static warnings
floppy/scsi: fix setting of BIO flags
memcg: fix deadlock by inverting lrucare nesting
drivers/rtc/rtc-r9701.c: fix crash in r9701_remove()
c2port: class_create() returns an ERR_PTR
pps: class_create() returns an ERR_PTR, not NULL
hung_task: fix the broken rcu_lock_break() logic
vfork: kill PF_STARTING
coredump_wait: don't call complete_vfork_done()
vfork: make it killable
vfork: introduce complete_vfork_done()
aio: wake up waiters when freeing unused kiocbs
kprobes: return proper error code from register_kprobe()
kmsg_dump: don't run on non-error paths by default
When moving tasks from old memcg (with move_charge_at_immigrate on new
memcg), followed by removal of old memcg, hit General Protection Fault in
mem_cgroup_lru_del_list() (called from release_pages called from
free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
exit_mmap from mmput from exit_mm from do_exit).
Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
still trying to update its stats, and take page off lru before freeing.
A task, or a charge, or a page on lru: each secures a memcg against
removal. In this case, the last task has been moved out of the old memcg,
and it is exiting: anonymous pages are uncharged one by one from the
memcg, as they are zapped from its pagetables, so the charge gets down to
0; but the pages themselves are queued in an mmu_gather for freeing.
Most of those pages will be on lru (and force_empty is careful to
lru_add_drain_all, to add pages from pagevec to lru first), but not
necessarily all: perhaps some have been isolated for page reclaim, perhaps
some isolated for other reasons. So, force_empty may find no task, no
charge and no page on lru, and let the removal proceed.
There would still be no problem if these pages were immediately freed; but
typically (and the put_page_testzero protocol demands it) they have to be
added back to lru before they are found freeable, then removed from lru
and freed. We don't see the issue when adding, because the
mem_cgroup_iter() loops keep their own reference to the memcg being
scanned; but when it comes to mem_cgroup_lru_del_list().
I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
38c5d72f3e ("memcg: simplify LRU handling by new rule") mercifully
removed those convolutions, but left this General Protection Fault.
But it's surprisingly easy to restore the old behaviour: just check
PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
change? just going back to how it worked before; testing, and an audit of
uses of pc->mem_cgroup, show no problem.
And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
longer has any purpose, and we can safely revert 4e5f01c2b9 ("memcg:
clear pc->mem_cgroup if necessary").
Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
is not strictly necessary: the lru_lock there, with RCU before memcg
structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
without that; but it seems cleaner to rely on one dependency less.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously it was (ab)used by utrace. Then it was wrongly used by the
scheduler code.
Currently it is not used, kill it before it finds the new erroneous user.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that CLONE_VFORK is killable, coredump_wait() no longer needs
complete_vfork_done(). zap_threads() should find and kill all tasks with
the same ->mm, this includes our parent if ->vfork_done is set.
mm_release() becomes the only caller, unexport complete_vfork_done().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make vfork() killable.
Change do_fork(CLONE_VFORK) to do wait_for_completion_killable(). If it
fails we do not return to the user-mode and never touch the memory shared
with our child.
However, in this case we should clear child->vfork_done before return, we
use task_lock() in do_fork()->wait_for_vfork_done() and
complete_vfork_done() to serialize with each other.
Note: now that we use task_lock() we don't really need completion, we
could turn task->vfork_done into "task_struct *wake_up_me" but this needs
some complications.
NOTE: this and the next patches do not affect in-kernel users of
CLONE_VFORK, kernel threads run with all signals ignored including
SIGKILL/SIGSTOP.
However this is obviously the user-visible change. Not only a fatal
signal can kill the vforking parent, a sub-thread can do execve or
exit_group() and kill the thread sleeping in vfork().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>