Using a helper instead of open code makes thaw_workqueues() clearer.
The helper will also be used by the next patch.
tj: Slight update to comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The existing work_on_cpu() implementation is hugely inefficient. It
creates a new kthread, execute that single function and then let the
kthread die on each invocation.
Now that system_wq can handle concurrent executions, there's no
advantage of doing this. Reimplement work_on_cpu() using system_wq
which makes it simpler and way more efficient.
stable: While this isn't a fix in itself, it's needed to fix a
workqueue related bug in cpufreq/powernow-k8. AFAICS, this
shouldn't break other existing users.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
@delayed is now always false for all callers, remove it.
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, when try_to_grab_pending() grabs a delayed work item, it
leaves its linked work items alone on the delayed_works. The linked
work items are always NO_COLOR and will cause future
cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
may cause the whole cwq to stall. For example,
state: cwq->max_active = 1, cwq->nr_active = 1
one work in cwq->pool, many in cwq->delayed_works.
step1: try_to_grab_pending() removes a work item from delayed_works
but leaves its NO_COLOR linked work items on it.
step2: Later on, cwq_activate_first_delayed() activates the linked
work item increasing ->nr_active.
step3: cwq->nr_active = 1, but all activated work items of the cwq are
NO_COLOR. When they finish, cwq->nr_active will not be
decreased due to NO_COLOR, and no further work items will be
activated from cwq->delayed_works. the cwq stalls.
Fix it by ensuring the target work item is activated before stealing
PENDING in try_to_grab_pending(). This ensures that all the linked
work items are activated without incorrectly bumping cwq->nr_active.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
workqueue_cpu_down_callback() is used only if HOTPLUG_CPU=y, so
hotcpu_notifier() fits better than cpu_notifier().
When HOTPLUG_CPU=y, hotcpu_notifier() and cpu_notifier() are the same.
When HOTPLUG_CPU=n, if we use cpu_notifier(),
workqueue_cpu_down_callback() will be called during boot to do
nothing, and the memory of workqueue_cpu_down_callback() and
gcwq_unbind_fn() will be discarded after boot.
If we use hotcpu_notifier(), we can avoid the no-op call of
workqueue_cpu_down_callback() and the memory of
workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discard at
build time:
$ ls -l kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
-rw-rw-r-- 1 laijs laijs 484080 Sep 15 11:31 kernel/workqueue.o.cpu_notifier
-rw-rw-r-- 1 laijs laijs 478240 Sep 15 11:31 kernel/workqueue.o.hotcpu_notifier
$ size kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
text data bss dec hex filename
18513 2387 1221 22121 5669 kernel/workqueue.o.cpu_notifier
18082 2355 1221 21658 549a kernel/workqueue.o.hotcpu_notifier
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
For workqueue hotplug callbacks, it makes less sense to use __devinit
which discards the memory after boot if !HOTPLUG. __cpuinit, which
discards the memory after boot if !HOTPLUG_CPU fits better.
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that manager_mutex's role has changed from synchronizing manager
role to excluding hotplug against manager, the name is misleading.
As it is protecting the CPU-association of the gcwq now, rename it to
assoc_mutex.
This patch is pure rename and doesn't introduce any functional change.
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now both worker destruction and idle rebinding remove the worker from
idle list while it's still idle, so list_empty(&worker->entry) can be
used to test whether either is pending and WORKER_DIE to distinguish
between the two instead making WORKER_REBIND unnecessary.
Use list_empty(&worker->entry) to determine whether destruction or
rebinding is pending. This simplifies worker state transitions.
WORKER_REBIND is not needed anymore. Remove it.
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Because the old unbind/rebinding implementation wasn't atomic w.r.t.
GCWQ_DISASSOCIATED manipulation which is protected by
global_cwq->lock, we had to use two flags, WORKER_UNBOUND and
WORKER_REBIND, to avoid incorrectly losing all NOT_RUNNING bits with
back-to-back CPU hotplug operations; otherwise, completion of
rebinding while another unbinding is in progress could clear UNBIND
prematurely.
Now that both unbind/rebinding are atomic w.r.t. GCWQ_DISASSOCIATED,
there's no need to use two flags. Just one is enough. Don't use
WORKER_REBIND for busy rebinding.
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently rebind_workers() uses rebinds idle workers synchronously
before proceeding to requesting busy workers to rebind. This is
necessary because all workers on @worker_pool->idle_list must be bound
before concurrency management local wake-ups from the busy workers
take place.
Unfortunately, the synchronous idle rebinding is quite complicated.
This patch reimplements idle rebinding to simplify the code path.
Rather than trying to make all idle workers bound before rebinding
busy workers, we simply remove all to-be-bound idle workers from the
idle list and let them add themselves back after completing rebinding
(successful or not).
As only workers which finished rebinding can on on the idle worker
list, the idle worker list is guaranteed to have only bound workers
unless CPU went down again and local wake-ups are safe.
After the change, @worker_pool->nr_idle may deviate than the actual
number of idle workers on @worker_pool->idle_list. More specifically,
nr_idle may be non-zero while ->idle_list is empty. All users of
->nr_idle and ->idle_list are audited. The only affected one is
too_many_workers() which is updated to check %false if ->idle_list is
empty regardless of ->nr_idle.
After this patch, rebind_workers() no longer performs the nasty
idle-rebind retries which require temporary release of gcwq->lock, and
both unbinding and rebinding are atomic w.r.t. global_cwq->lock.
worker->idle_rebind and global_cwq->rebind_hold are now unnecessary
and removed along with the definition of struct idle_rebind.
Changed from V1:
1) remove unlikely from too_many_workers(), ->idle_list can be empty
anytime, even before this patch, no reason to use unlikely.
2) fix a small rebasing mistake.
(which is from rebasing the orignal fixing patch to for-next)
3) add a lot of comments.
4) clear WORKER_REBIND unconditionaly in idle_worker_rebind()
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Implement kprojid_t a cousin of the kuid_t and kgid_t.
The per user namespace mapping of project id values can be set with
/proc/<pid>/projid_map.
A full compliment of helpers is provided: make_kprojid, from_kprojid,
from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
projid_eq, projid_lt.
Project identifiers are part of the generic disk quota interface,
although it appears only xfs implements project identifiers currently.
The xfs code allows anyone who has permission to set the project
identifier on a file to use any project identifier so when
setting up the user namespace project identifier mappings I do
not require a capability.
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- When tracing capture the kuid.
- When displaying the data to user space convert the kuid into the
user namespace of the process that opened the report file.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
BSD process accounting conveniently passes the file the accounting
records will be written into to do_acct_process. The file credentials
captured the user namespace of the opener of the file. Use the file
credentials to format the uid and the gid of the current process into
the user namespace of the user that started the bsd process
accounting.
Cc: Pavel Emelyanov <xemul@openvz.org>
Reviewed-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Explicitly limit exit task stat broadcast to the initial user and
pid namespaces, as it is already limited to the initial network
namespace.
- For broadcast task stats explicitly generate all of the idenitiers
in terms of the initial user namespace and the initial pid
namespace.
- For request stats report them in terms of the current user namespace
and the current pid namespace. Netlink messages are delivered
syncrhonously to the kernel allowing us to get the user namespace
and the pid namespace from the current task.
- Pass the namespaces for representing pids and uids and gids
into bacct_add_task.
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Explicitly format uids gids in audit messges in the initial user
namespace. This is safe because auditd is restrected to be in
the initial user namespace.
- Convert audit_sig_uid into a kuid_t.
- Enable building the audit code and user namespaces at the same time.
The net result is that the audit subsystem now uses kuid_t and kgid_t whenever
possible making it almost impossible to confuse a raw uid_t with a kuid_t
preventing bugs.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Always store audit loginuids in type kuid_t.
Print loginuids by converting them into uids in the appropriate user
namespace, and then printing the resulting uid.
Modify audit_get_loginuid to return a kuid_t.
Modify audit_set_loginuid to take a kuid_t.
Modify /proc/<pid>/loginuid on read to convert the loginuid into the
user namespace of the opener of the file.
Modify /proc/<pid>/loginud on write to convert the loginuid
rom the user namespace of the opener of the file.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Paul Moore <paul@paul-moore.com> ?
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The audit filter code guarantees that uid are always compared with
uids and gids are always compared with gids, as the comparason
operations are type specific. Take advantage of this proper to define
audit_uid_comparator and audit_gid_comparator which use the type safe
comparasons from uidgid.h.
Build on audit_uid_comparator and audit_gid_comparator and replace
audit_compare_id with audit_compare_uid and audit_compare_gid. This
is one of those odd cases where being type safe and duplicating code
leads to simpler shorter and more concise code.
Don't allow bitmask operations in uid and gid comparisons in
audit_data_to_entry. Bitmask operations are already denined in
audit_rule_to_entry.
Convert constants in audit_rule_to_entry and audit_data_to_entry into
kuids and kgids when appropriate.
Convert the uid and gid field in struct audit_names to be of type
kuid_t and kgid_t respectively, so that the new uid and gid comparators
can be applied in a type safe manner.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The only place we use the uid and the pid that we calculate in
audit_receive_msg is in audit_log_common_recv_msg so move the
calculation of these values into the audit_log_common_recv_msg.
Simplify the calcuation of the current pid and uid by
reading them from current instead of reading them from
NETLINK_CREDS.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
For user generated audit messages set the portid field in the netlink
header to the netlink port where the user generated audit message came
from. Reporting the process id in a port id field was just nonsense.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use current instead of looking up the current up the current task by
process identifier. Netlink requests are processed in trhe context of
the sending task so this is safe.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that netlink messages are processed in the context of the sender
tty_audit_push_task can be called directly and audit_prepare_user_tty
which only added looking up the task of the tty by process id is
not needed.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Get caller process uid and gid and pid values from the current task
instead of the NETLINK_CB. This is simpler than passing NETLINK_CREDS
from from audit_receive_msg to audit_filter_user_rules and avoid the
chance of being hit by the occassional bugs in netlink uid/gid
credential passing. This is a safe changes because all netlink
requests are processed in the task of the sending process.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This allows the code to safely make the assumption that all of the
uids gids and pids that need to be send in audit messages are in the
initial namespaces.
If someone cares we may lift this restriction someday but start with
limiting access so at least the code is always correct.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This merge is necessary as Lai's CPU hotplug restructuring series
depends on the CPU hotplug bug fixes in for-3.6-fixes.
The merge creates one trivial conflict between the following two
commits.
96e65306b8 "workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic"
e2b6a6d570 "workqueue: use system_highpri_wq for highpri workers in rebind_workers()"
Both add local variable definitions to the same block and can be
merged in any order.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull another workqueue fix from Tejun Heo:
"Unfortunately, yet another late fix. This too is discovered and fixed
by Lai. This bug was introduced during this merge window by commit
25511a4776 ("workqueue: reimplement CPU online rebinding to handle
idle workers") which started using WORKER_REBIND flag for idle rebind
too.
The bug is relatively easy to trigger if the CPU rapidly goes through
off, on and then off (and stay off). The fix is on the safer side.
This hasn't been on linux-next yet but I'm pushing early so that it
can get more exposure before v3.6 release."
* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
The kernel doesn't check the pid for negative values, so if you try to
write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic.
The crash happens because the next pid is -1, and alloc_pidmap() will
try to access to a nonexistent pidmap.
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pm-sleep:
properly __init-annotate pm_sysrq_init()
PM / wakeup: Use irqsave/irqrestore for events_lock
PM / Freezer: Fix small typo "regrigerator"
PM / Sleep: Print name of wakeup source that aborts suspend
* pm-timers:
PM: Do not use the syscore flag for runtime PM
sh: MTU2: Basic runtime PM support
sh: CMT: Basic runtime PM support
sh: TMU: Basic runtime PM support
PM / Domains: Do not measure start time for "irq safe" devices
PM / Domains: Move syscore flag from subsys data to struct device
PM / Domains: Rename the always_on device flag to syscore
PM / Runtime: Allow helpers to be called by early platform drivers
PM: Reorganize device PM initialization
sh: MTU2: Introduce clock events suspend/resume routines
sh: CMT: Introduce clocksource/clock events suspend/resume routines
sh: TMU: Introduce clocksource/clock events suspend/resume routines
timekeeping: Add suspend and resume of clock event devices
PM / Domains: Add power off/on function for system core suspend stage
PM / Domains: Introduce simplified power on routine for system resume
This patch allows setting of the show_unhandled_signals variable via
/proc/sys/debug/exception-trace. The default value is currently 1
showing unhandled user faults (undefined instructions, data aborts) and
invalid signal stack frames.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
This reverts commit 970e178985.
Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
performance drop on PostgreSQL 9.2 on his machine (running "pgbench").
Borislav Petkov was able to reproduce this, and bisected it to this
commit 970e178985 ("sched: Improve scalability via 'CPU buddies' ...")
apparently because the new single-idle-buddy model simply doesn't find
idle CPU's to reschedule on aggressively enough.
Mike Galbraith suspects that it is likely due to the user-mode spinlocks
in PostgreSQL not reacting well to preemption, but we don't really know
the details - I'll just revert the commit for now.
There are hopefully other approaches to improve scheduler scalability
without it causing these kinds of downsides.
Reported-by: Nikolay Ulyanitsky <lystor@gmail.com>
Bisected-by: Borislav Petkov <bp@alien8.de>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
net/netfilter/nfnetlink_log.c
net/netfilter/xt_LOG.c
Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.
Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.
Signed-off-by: David S. Miller <davem@davemloft.net>
As Oleg pointed out in [0] uprobe should not use the ptrace interface
for enabling/disabling single stepping.
[0] http://lkml.kernel.org/r/20120730141638.GA5306@redhat.com
Add the new "__weak arch" helpers which simply call user_*_single_step()
as a preparation. This is only needed to not break the powerpc port, we
will fold this logic into arch_uprobe_pre/post_xol() hooks later.
We should also change handle_singlestep(), _disable_step(&uprobe->arch)
should be called before put_uprobe().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
The wrong MMF_HAS_UPROBES doesn't really hurt, just it triggers
the "slow" and unnecessary handle_swbp() path if the task hits
the non-uprobe breakpoint.
So this patch changes find_active_uprobe() to check every valid
vma and clear MMF_HAS_UPROBES if no uprobes were found. This is
adds the slow O(n) path, but it is only called in unlikely case
when the task hits the normal breakpoint first time after
uprobe_unregister().
Note the "not strictly accurate" comment in mmf_recalc_uprobes().
We can fix this, we only need to teach vma_has_uprobes() to return
a bit more more info, but I am not sure this worth the trouble.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Add the new MMF_RECALC_UPROBES flag, it means that MMF_HAS_UPROBES
can be false positive after remove_breakpoint() or uprobe_munmap().
It is also set by uprobe_dup_mmap(), this is not optimal but simple.
We could add the new hook, uprobe_dup_vma(), to set MMF_HAS_UPROBES
only if the new mm actually has uprobes, but I don't think this
makes sense.
The next patch will use this flag to clear MMF_HAS_UPROBES.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Nobody plays with uprobes_tree/uprobes_treelock in interrupt context,
no need to disable irqs.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Pull perf fixes from Ingo Molnar:
"This tree includes various fixes"
Ingo really needs to improve on the whole "explain git pull" part.
"Various fixes" indeed.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled
perf/x86: Enable Intel Cedarview Atom suppport
perf_event: Switch to internal refcount, fix race with close()
oprofile, s390: Fix uninitialized memory access when writing to oprofilefs
perf/x86: Fix microcode revision check for SNB-PEBS
Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.
These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.
Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.
This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.
* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.
* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.
For now, start with a single warning message. We can whine louder
later on.
v2: Fixed a typo spotted by Michal. Warning message updated.
v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.
v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
WARNING: With this change it is impossible to load external built
controllers anymore.
In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
set, corresponding subsys_id should also be a constant. Up to now,
net_prio_subsys_id and net_cls_subsys_id would be of the type int and
the value would be assigned during runtime.
By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
to IS_ENABLED, all *_subsys_id will have constant value. That means we
need to remove all the code which assumes a value can be assigned to
net_prio_subsys_id and net_cls_subsys_id.
A close look is necessary on the RCU part which was introduces by
following patch:
commit f845172531
Author: Herbert Xu <herbert@gondor.apana.org.au> Mon May 24 09:12:34 2010
Committer: David S. Miller <davem@davemloft.net> Mon May 24 09:12:34 2010
cls_cgroup: Store classid in struct sock
Tis code was added to init_cgroup_cls()
/* We can't use rcu_assign_pointer because this is an int. */
smp_wmb();
net_cls_subsys_id = net_cls_subsys.subsys_id;
respectively to exit_cgroup_cls()
net_cls_subsys_id = -1;
synchronize_rcu();
and in module version of task_cls_classid()
rcu_read_lock();
id = rcu_dereference(net_cls_subsys_id);
if (id >= 0)
classid = container_of(task_subsys_state(p, id),
struct cgroup_cls_state, css)->classid;
rcu_read_unlock();
Without an explicit explaination why the RCU part is needed. (The
rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
in a later commit, but that is a minor detail.)
So here is my pondering why it was introduced and why it safe to
remove it now. Note that this code was copied over to net_prio the
reasoning holds for that subsystem too.
The idea behind the RCU use for net_cls_subsys_id is to make sure we
get a valid pointer back from task_subsys_state(). task_subsys_state()
is just blindly accessing the subsys array and returning the
pointer. Obviously, passing in -1 as id into task_subsys_state()
returns an invalid value (out of lower bound).
So this code makes sure that only after module is loaded and the
subsystem registered, the id is assigned.
Before unregistering the module all old readers must have left the
critical section. This is done by assigning -1 to the id and issuing a
synchronized_rcu(). Any new readers wont call task_subsys_state()
anymore and therefore it is safe to unregister the subsystem.
The new code relies on the same trick, but it looks at the subsys
pointer return by task_subsys_state() (remember the id is constant
and therefore we allways have a valid index into the subsys
array).
No precautions need to be taken during module loading
module. Eventually, all CPUs will get a valid pointer back from
task_subsys_state() because rebind_subsystem() which is called after
the module init() function will assigned subsys[net_cls_subsys_id] the
newly loaded module subsystem pointer.
When the subsystem is about to be removed, rebind_subsystem() will
called before the module exit() function. In this case,
rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
and then it calls synchronize_rcu(). All old readers have left by then
the critical section. Any new reader wont access the subsystem
anymore. At this point we are safe to unregister the subsystem. No
synchronize_rcu() call is needed.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Before we are able to define all subsystem ids at compile time we need
a more fine grained control what gets defined when we include
cgroup_subsys.h. For example we define the enums for the subsystems or
to declare for struct cgroup_subsys (builtin subsystem) by including
cgroup_subsys.h and defining SUBSYS accordingly.
Currently, the decision if a subsys is used is defined inside the
header by testing if CONFIG_*=y is true. By moving this test outside
of cgroup_subsys.h we are able to control it on the include level.
This is done by introducing IS_SUBSYS_ENABLED which then is defined
according the task, e.g. is CONFIG_*=y or CONFIG_*=m.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
looping over the subsys array looking either at the builtin or the
module subsystems. Since all the builtin subsystems have an id which
is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
are sorted.
We are about to change id assignment to happen only at compile time
later in this series. That means we can't rely on the above trick
since all ids will always be defined at compile time. Furthermore,
ordering the builtin subsystems and the module subsystems is not
really necessary.
So we need a different way to know which subsystem is a builtin or a
module one. We can use the subsys[]->module pointer for this. Any
place where we need to know if a subsys is module we just check for
the pointer. If it is NULL then the subsystem is a builtin one.
With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
enum. Though we need to introduce a temporary placeholder so that we
don't get a compilation error when only CONFIG_CGROUP is selected and
no single controller. An empty enum definition is not valid. Later in
this series we are able to remove the placeholder again.
And with this change we get a fix for this:
kernel/cgroup.c: In function ‘cgroup_load_subsys’:
kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]
when CONFIG_CGROUP=y and no built in controller was enabled.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Fix kprobes/x86 to support jprobes on ftrace-based kprobes.
Because of -mfentry support of ftrace, ftrace is now put
on the beginning of function where jprobes are put.
Originally ftrace-based kprobes doesn't support jprobe
because it will change regs->ip and ftrace doesn't support
changing IP and ftrace itself doesn't conflict jprobe.
However, ftrace -mfentry support moves mcount call on the
top of functions where jprobes are put. This means that
jprobe always conflicts with ftrace-based kprobe and fails.
This patch allows ftrace-based kprobes to support jprobes
by allowing to modify regs->ip and kprobes breakpoint
handler also allows to skip singlestepping because there
is a ftrace call (not an original instruction).
Link: http://lkml.kernel.org/r/20120905143125.10329.90836.stgit@localhost.localdomain
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>