android_kernel_xiaomi_sm8450

xiaomi-sm8450/android_kernel_xiaomi_sm8450

Author	SHA1	Message	Date
Michael Wang	62470419e9	sched: Implement smarter wake-affine logic The wake-affine scheduler feature is currently always trying to pull the wakee close to the waker. In theory this should be beneficial if the waker's CPU caches hot data for the wakee, and it's also beneficial in the extreme ping-pong high context switch rate case. Testing shows it can benefit hackbench up to 15%. However, the feature is somewhat blind, from which some workloads such as pgbench suffer. It's also time-consuming algorithmically. Testing shows it can damage pgbench up to 50% - far more than the benefit it brings in the best case. So wake-affine should be smarter and it should realize when to stop its thankless effort at trying to find a suitable CPU to wake on. This patch introduces 'wakee_flips', which will be increased each time the task flips (switches) its wakee target. So a high 'wakee_flips' value means the task has more than one wakee, and the bigger the number, the higher the wakeup frequency. Now when making the decision on whether to pull or not, pay attention to the wakee with a high 'wakee_flips', pulling such a task may benefit the wakee. Also imply that the waker will face cruel competition later, it could be very cruel or very fast depends on the story behind 'wakee_flips', waker therefore suffers. Furthermore, if waker also has a high 'wakee_flips', that implies that multiple tasks rely on it, then waker's higher latency will damage all of them, so pulling wakee seems to be a bad deal. Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes higher and higher, the cost of pulling seems to be worse and worse. The patch therefore helps the wake-affine feature to stop its pulling work when: wakee->wakee_flips > factor && waker->wakee_flips > (factor * wakee->wakee_flips) The 'factor' here is the number of CPUs in the current CPU's NUMA node, so a bigger node will lead to more pulling since the trial becomes more severe. After applying the patch, pgbench shows up to 40% improvements and no regressions. Tested with 12 cpu x86 server and tip 3.10.0-rc7. The percentages in the final column highlight the areas with the biggest wins, all other areas improved as well: pgbench base smart \| db_size \| clients \| tps \| \| tps \| +---------+---------+-------+ +-------+ \| 22 MB \| 1 \| 10598 \| \| 10796 \| \| 22 MB \| 2 \| 21257 \| \| 21336 \| \| 22 MB \| 4 \| 41386 \| \| 41622 \| \| 22 MB \| 8 \| 51253 \| \| 57932 \| \| 22 MB \| 12 \| 48570 \| \| 54000 \| \| 22 MB \| 16 \| 46748 \| \| 55982 \| +19.75% \| 22 MB \| 24 \| 44346 \| \| 55847 \| +25.93% \| 22 MB \| 32 \| 43460 \| \| 54614 \| +25.66% \| 7484 MB \| 1 \| 8951 \| \| 9193 \| \| 7484 MB \| 2 \| 19233 \| \| 19240 \| \| 7484 MB \| 4 \| 37239 \| \| 37302 \| \| 7484 MB \| 8 \| 46087 \| \| 50018 \| \| 7484 MB \| 12 \| 42054 \| \| 48763 \| \| 7484 MB \| 16 \| 40765 \| \| 51633 \| +26.66% \| 7484 MB \| 24 \| 37651 \| \| 52377 \| +39.11% \| 7484 MB \| 32 \| 37056 \| \| 51108 \| +37.92% \| 15 GB \| 1 \| 8845 \| \| 9104 \| \| 15 GB \| 2 \| 19094 \| \| 19162 \| \| 15 GB \| 4 \| 36979 \| \| 36983 \| \| 15 GB \| 8 \| 46087 \| \| 49977 \| \| 15 GB \| 12 \| 41901 \| \| 48591 \| \| 15 GB \| 16 \| 40147 \| \| 50651 \| +26.16% \| 15 GB \| 24 \| 37250 \| \| 52365 \| +40.58% \| 15 GB \| 32 \| 36470 \| \| 50015 \| +37.14% Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com [ Improved the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-23 12:18:41 +02:00
Vladimir Davydov	685207963b	sched: Move h_load calculation to task_h_load() The bad thing about update_h_load(), which computes hierarchical load factor for task groups, is that it is called for each task group in the system before every load balancer run, and since rebalance can be triggered very often, this function can eat really a lot of cpu time if there are many cpu cgroups in the system. Although the situation was improved significantly by commit `a35b646` ('sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies'), the problem still can arise under some kinds of loads, e.g. when cpus are switching from idle to busy and back very frequently. For instance, when I start 1000 of processes that wake up every millisecond on my 8 cpus host, 'top' and 'perf top' show: Cpu(s): 17.8%us, 24.3%sy, 0.0%ni, 57.9%id, 0.0%wa, 0.0%hi, 0.0%si Events: 243K cycles 7.57% [kernel] [k] __schedule 7.08% [kernel] [k] timerqueue_add 6.13% libc-2.12.so [.] usleep Then if I create 10000 idle cpu cgroups (no processes in them), cpu usage increases significantly although the 'wakers' are still executing in the root cpu cgroup: Cpu(s): 19.1%us, 48.7%sy, 0.0%ni, 31.6%id, 0.0%wa, 0.0%hi, 0.7%si Events: 230K cycles 24.56% [kernel] [k] tg_load_down 5.76% [kernel] [k] __schedule This happens because this particular kind of load triggers 'new idle' rebalance very frequently, which requires calling update_h_load(), which, in turn, calls tg_load_down() for every idle cpu cgroup even though it is absolutely useless, because idle cpu cgroups have no tasks to pull. This patch tries to improve the situation by making h_load calculation proceed only when h_load is really necessary. To achieve this, it substitutes update_h_load() with update_cfs_rq_h_load(), which computes h_load only for a given cfs_rq and all its ascendants, and makes the load balancer call this function whenever it considers if a task should be pulled, i.e. it moves h_load calculations directly to task_h_load(). For h_load of the same cfs_rq not to be updated multiple times (in case several tasks in the same cgroup are considered during the same balance run), the patch keeps the time of the last h_load update for each cfs_rq and breaks calculation when it finds h_load to be uptodate. The benefit of it is that h_load is computed only for those cfs_rq's, which really need it, in particular all idle task groups are skipped. Although this, in fact, moves h_load calculation under rq lock, it should not affect latency much, because the amount of work done under rq lock while trying to pull tasks is limited by sched_nr_migrate. After the patch applied with the setup described above (1000 wakers in the root cgroup and 10000 idle cgroups), I get: Cpu(s): 16.9%us, 24.8%sy, 0.0%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si Events: 242K cycles 7.57% [kernel] [k] __schedule 6.70% [kernel] [k] timerqueue_add 5.93% libc-2.12.so [.] usleep Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1373896159-1278-1-git-send-email-vdavydov@parallels.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-23 12:18:41 +02:00
Peter Zijlstra	a5cdd40c98	perf: Update perf_event_type documentation Due to a discussion with Adrian I had a good look at the perf_event_type record layout and found the documentation to be somewhat unclear. Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130716150907.GL23818@dyad.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-23 12:17:08 +02:00
Davidlohr Bueso	ec83f425db	mutex: Do not unnecessarily deal with waiters Upon entering the slowpath, we immediately attempt to acquire the lock by checking if it is already unlocked. If we are lucky enough that this is the case, then we don't need to deal with any waiter related logic. Furthermore any checks for an empty wait_list are unnecessary as we already know that count is non-negative and hence no one is waiting for the lock. Move the count check and xchg calls to be done before any waiters are setup - including waiter debugging. Upon failure to acquire the lock, the xchg sets the counter to 0, instead of -1 as it was originally. This can be done here since we set it back to -1 right at the beginning of the loop so other waiters are woken up when the lock is released. When tested on a 8-socket (80 core) system against a vanilla 3.10-rc1 kernel, this patch provides some small performance benefits (+2-6%). While it could be considered in the noise level, the average percentages were stable across multiple runs and no performance regressions were seen. Two big winners, for small amounts of users (10-100), were the short and compute workloads had a +19.36% and +%15.76% in jobs per minute. Also change some break statements to 'goto slowpath', which IMO makes a little more intuitive to read. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1372450398.2106.1.camel@buesod1.americas.hpqcorp.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-23 11:48:37 +02:00
Ingo Molnar	b59f2b4d30	Merge tag 'v3.11-rc2' into core/locking Merge in Linux 3.11-rc2 before moving on with new work. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-23 11:48:17 +02:00
Jiri Kosina	17f41571bb	kprobes/x86: Call out into INT3 handler directly instead of using notifier In `fd4363fff3` ("x86: Introduce int3 (breakpoint)-based instruction patching"), the mechanism that was introduced for notifying alternatives code from int3 exception handler that and exception occured was die_notifier. This is however problematic, as early code might be using jump labels even before the notifier registration has been performed, which will then lead to an oops due to unhandled exception. One of such occurences has been encountered by Fengguang: int3: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.11.0-rc1-01429-g04bf576 #8 task: ffff88000da1b040 ti: ffff88000da1c000 task.ti: ffff88000da1c000 RIP: 0010:[<ffffffff811098cc>] [<ffffffff811098cc>] ttwu_do_wakeup+0x28/0x225 RSP: 0000:ffff88000dd03f10 EFLAGS: 00000006 RAX: 0000000000000000 RBX: ffff88000dd12940 RCX: ffffffff81769c40 RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88000dd03f28 R08: ffffffff8176a8c0 R09: 0000000000000002 R10: ffffffff810ff484 R11: ffff88000dd129e8 R12: ffff88000dbc90c0 R13: ffff88000dbc90c0 R14: ffff88000da1dfd8 R15: ffff88000da1dfd8 FS: 0000000000000000(0000) GS:ffff88000dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000ffffffff CR3: 0000000001c88000 CR4: 00000000000006e0 Stack: ffff88000dd12940 ffff88000dbc90c0 ffff88000da1dfd8 ffff88000dd03f48 ffffffff81109e2b ffff88000dd12940 0000000000000000 ffff88000dd03f68 ffffffff81109e9e 0000000000000000 0000000000012940 ffff88000dd03f98 Call Trace: <IRQ> [<ffffffff81109e2b>] ttwu_do_activate.constprop.56+0x6d/0x79 [<ffffffff81109e9e>] sched_ttwu_pending+0x67/0x84 [<ffffffff8110c845>] scheduler_ipi+0x15a/0x2b0 [<ffffffff8104dfb4>] smp_reschedule_interrupt+0x38/0x41 [<ffffffff8173bf5d>] reschedule_interrupt+0x6d/0x80 <EOI> [<ffffffff810ff484>] ? __atomic_notifier_call_chain+0x5/0xc1 [<ffffffff8105cc30>] ? native_safe_halt+0xd/0x16 [<ffffffff81015f10>] default_idle+0x147/0x282 [<ffffffff81017026>] arch_cpu_idle+0x3d/0x5d [<ffffffff81127d6a>] cpu_idle_loop+0x46d/0x5db [<ffffffff81127f5c>] cpu_startup_entry+0x84/0x84 [<ffffffff8104f4f8>] start_secondary+0x3c8/0x3d5 [...] Fix this by directly calling poke_int3_handler() from the int3 exception handler (analogically to what ftrace has been doing already), instead of relying on notifier, registration of which might not have yet been finalized by the time of the first trap. Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1307231007490.14024@pobox.suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-23 10:12:57 +02:00
Linus Torvalds	b3a3a9c441	Merge tag 'trace-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes and cleanups from Steven Rostedt: "This contains fixes, optimizations and some clean ups Some of the fixes need to go back to 3.10. They are minor, and deal mostly with incorrect ref counting in accessing event files. There was a couple of optimizations that should have perf perform a bit better when accessing trace events. And some various clean ups. Some of the clean ups are necessary to help in a fix to a theoretical race between opening a event file and deleting that event" * tag 'trace-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Kill the unbalanced tr->ref++ in tracing_buffers_open() tracing: Kill trace_array->waiter tracing: Do not (ab)use trace_seq in event_id_read() tracing: Simplify the iteration logic in f_start/f_next tracing: Add ref_data to function and fgraph tracer structs tracing: Miscellaneous fixes for trace_array ref counting tracing: Fix error handling to ensure instances can always be removed tracing/kprobe: Wait for disabling all running kprobe handlers tracing/perf: Move the PERF_MAX_TRACE_SIZE check into perf_trace_buf_prepare() tracing/syscall: Avoid perf_trace_buf_() if sys_data->perf_events is empty tracing/function: Avoid perf_trace_buf_() if event_function.perf_events is empty tracing: Typo fix on ring buffer comments tracing: Use trace_seq_puts()/trace_seq_putc() where possible tracing: Use correct config guard CONFIG_STACK_TRACER	2013-07-22 19:07:24 -07:00
Baruch Siach	53c0352042	sched_clock: Fix integer overflow The expression '(1 << 32)' happens to evaluate as 0 on ARM, but it evaluates as 1 on xtensa and x86_64. This zeros sched_clock_mask, and breaks sched_clock(). Set the type of 1 to 'unsigned long long' to get the value we need. Reported-by: Max Filippov <jcmvbkbc@gmail.com> Tested-by: Max Filippov <jcmvbkbc@gmail.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: John Stultz <john.stultz@linaro.org>	2013-07-22 16:24:22 -07:00
Prarit Bhargava	397bbf6dee	clocksource: Fix !CONFIG_CLOCKSOURCE_WATCHDOG compile If I explicitly disable the clocksource watchdog in the x86 Kconfig, the x86 kernel will not compile unless this is properly defined. Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2013-07-22 16:00:17 -07:00
Peter Zijlstra	1e40c2edef	mutex: Fix/document access-once assumption in mutex_can_spin_on_owner() mutex_can_spin_on_owner() is technically broken in that it would in theory allow the compiler to load lock->owner twice, seeing a pointer first time and a NULL pointer the second time. Linus pointed out that a compiler has to be seriously broken to not compile this correctly - but nevertheless this change is correct as it will better document the implementation. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Acked-by: Waiman Long <Waiman.Long@hp.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Link: http://lkml.kernel.org/r/20130719183101.GA20909@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-22 10:33:39 +02:00
Kirill Tkhai	87e3c8ae1c	sched/fair: Cleanup: remove duplicate variable declaration cfs_rq is declared twice, fix it. Also use 'se' instead of '&p->se'. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/169201374366727@web6d.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-22 10:27:40 +02:00
Ingo Molnar	b24d6f4912	Merge tag 'v3.11-rc2' into sched/core Merge in Linux 3.11-rc2, to provide a post-merge-window development base. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-22 10:26:10 +02:00
Oleg Nesterov	e70e78e3c8	tracing: Kill the unbalanced tr->ref++ in tracing_buffers_open() tracing_buffers_open() does trace_array_get() and then it wrongly inrcements tr->ref again under trace_types_lock. This means that every caller leaks trace_array: # cd /sys/kernel/debug/tracing/ # mkdir instances/X # true < instances/X/per_cpu/cpu0/trace_pipe_raw # rmdir instances/X rmdir: failed to remove `instances/X': Device or resource busy Link: http://lkml.kernel.org/r/20130719153644.GA18899@redhat.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: stable@vger.kernel.org # 3.10 Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-19 14:32:22 -04:00
Linus Torvalds	b7356abb9f	Merge tag 'pm+acpi-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI fixes from Rafael Wysocki: "These are fixes collected over the last week, most importnatly two cpufreq reverts fixing regressions introduced in 3.10, an autoseelp fix preventing systems using it from crashing during shutdown and two ACPI scan fixes related to hotplug. Specifics: - Two cpufreq commits from the 3.10 cycle introduced regressions. The first of them was buggy (it did way much more than it needed to do) and the second one attempted to fix an issue introduced by the first one. Fixes from Srivatsa S Bhat revert both. - If autosleep triggers during system shutdown and the shutdown callbacks of some device drivers have been called already, it may crash the system. Fix from Liu Shuo prevents that from happening by making try_to_suspend() check system_state. - The ACPI memory hotplug driver doesn't clear its driver_data on errors which may cause a NULL poiter dereference to happen later. Fix from Toshi Kani. - The ACPI namespace scanning code should not try to attach scan handlers to device objects that have them already, which may confuse things quite a bit, and it should rescan the whole namespace branch starting at the given node after receiving a bus check notify event even if the device at that particular node has been discovered already. Fixes from Rafael J Wysocki. - New ACPI video blacklist entry for a system whose initial backlight setting from the BIOS doesn't make sense. From Lan Tianyu. - Garbage string output avoindance for ACPI PNP from Liu Shuo. - Two Kconfig fixes for issues introduced recently in the s3c24xx cpufreq driver (when moving the driver to drivers/cpufreq) from Paul Bolle. - Trivial comment fix in pm_wakeup.h from Chanwoo Choi" * tag 'pm+acpi-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / video: ignore BIOS initial backlight value for Fujitsu E753 PNP / ACPI: avoid garbage in resource name cpufreq: Revert commit `2f7021a8` to fix CPU hotplug regression cpufreq: s3c24xx: fix "depends on ARM_S3C24XX" in Kconfig cpufreq: s3c24xx: rename CONFIG_CPU_FREQ_S3C24XX_DEBUGFS PM / Sleep: Fix comment typo in pm_wakeup.h PM / Sleep: avoid 'autosleep' in shutdown progress cpufreq: Revert commit a66b2e to fix suspend/resume regression ACPI / memhotplug: Fix a stale pointer in error path ACPI / scan: Always call acpi_bus_scan() for bus check notifications ACPI / scan: Do not try to attach scan handlers to devices having them	2013-07-19 09:59:06 -07:00
Oleg Nesterov	a644a7e958	tracing: Kill trace_array->waiter Trivial. trace_array->waiter has no users since `6eaaa5d5` "tracing/core: use appropriate waiting on trace_pipe". Link: http://lkml.kernel.org/r/20130719142036.GA1594@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-19 10:56:02 -04:00
Ingo Molnar	9bb15425c3	Merge branch 'x86/jumplabel' into perf/core Upcoming kprobes patches rely on the int3 code-patching machinery introduced by: `fd4363fff3` x86: Introduce int3 (breakpoint)-based instruction patching Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-19 09:55:00 +02:00
Ingo Molnar	e43fff2b98	Merge branch 'linus' into perf/core Merge in a v3.11-rc1-ish branch to go from v3.10 based development to a v3.11 based one. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-19 09:34:42 +02:00
Oleg Nesterov	cd458ba9d5	tracing: Do not (ab)use trace_seq in event_id_read() event_id_read() has no reason to kmalloc "struct trace_seq" (more than PAGE_SIZE!), it can use a small buffer instead. Note: "if (*ppos) return 0" looks strange and even wrong, simple_read_from_buffer() handles ppos != 0 case corrrectly. And it seems that almost every user of trace_seq in this file should be converted too. Unless you use seq_open(), trace_seq buys nothing compared to the raw buffer, but it needs a bit more memory and code. Link: http://lkml.kernel.org/r/20130718184712.GA4786@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:33 -04:00
Oleg Nesterov	7710b63995	tracing: Simplify the iteration logic in f_start/f_next f_next() looks overcomplicated, and it is not strictly correct even if this doesn't matter. Say, FORMAT_FIELD_SEPERATOR should not return NULL (means EOF) if trace_get_fields() returns an empty list, we should simply advance to FORMAT_PRINTFMT as we do when we find the end of list. 1. Change f_next() to return "struct list_head " rather than "ftrace_event_field ", and change f_show() to do list_entry(). This simplifies the code a bit, only f_show() needs to know about ftrace_event_field, and f_next() can play with ->prev directly 2. Change f_next() to not play with ->prev / return inside the switch() statement. It can simply set node = head/common_head, the prev-or-advance-to-the-next-magic below does all work. While at it. f_start() looks overcomplicated too. I don't think *pos == 0 makes sense as a separate case, just change this code to do "while" instead of "do/while". The patch also moves f_start() down, close to f_stop(). This is purely cosmetic, just to make the locking added by the next patch more clear/visible. Link: http://lkml.kernel.org/r/20130718184710.GA4783@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:32 -04:00
Steven Rostedt (Red Hat)	8f76899339	tracing: Add ref_data to function and fgraph tracer structs The selftest for function and function graph tracers are defined as __init, as they are only executed at boot up. The "tracer" structs that are associated to those tracers are not setup as __init as they are used after boot. To stop mismatch warnings, those structures need to be annotated with __ref_data. Currently, the tracer structures are defined to __read_mostly, as they do not really change. But in the future they should be converted to consts, but that will take a little work because they have a "next" pointer that gets updated when they are registered. That will have to wait till the next major release. Link: http://lkml.kernel.org/r/1373596735.17876.84.camel@gandalf.local.home Reported-by: kbuild test robot <fengguang.wu@intel.com> Reported-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:31 -04:00
Alexander Z Lam	f77d09a384	tracing: Miscellaneous fixes for trace_array ref counting Some error paths did not handle ref counting properly, and some trace files need ref counting. Link: http://lkml.kernel.org/r/1374171524-11948-1-git-send-email-azl@google.com Cc: stable@vger.kernel.org # 3.10 Cc: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: David Sharp <dhsharp@google.com> Cc: Alexander Z Lam <lambchop468@gmail.com> Signed-off-by: Alexander Z Lam <azl@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:30 -04:00
Alexander Z Lam	609e85a70b	tracing: Fix error handling to ensure instances can always be removed Remove debugfs directories for tracing instances during creation if an error occurs causing the trace_array for that instance to not be added to ftrace_trace_arrays. If the directory continues to exist after the error, it cannot be removed because the respective trace_array is not in ftrace_trace_arrays. Link: http://lkml.kernel.org/r/1373502874-1706-2-git-send-email-azl@google.com Cc: stable@vger.kernel.org # 3.10 Cc: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: David Sharp <dhsharp@google.com> Cc: Alexander Z Lam <lambchop468@gmail.com> Signed-off-by: Alexander Z Lam <azl@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:30 -04:00
Masami Hiramatsu	a232e270dc	tracing/kprobe: Wait for disabling all running kprobe handlers Wait for disabling all running kprobe handlers when a kprobe event is disabled, since the caller, trace_remove_event_call() supposes that a removing event is disabled completely by disabling the event. With this change, ftrace can ensure that there is no running event handlers after disabling it. Link: http://lkml.kernel.org/r/20130709093526.20138.93100.stgit@mhiramat-M0-7522 Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:29 -04:00
Oleg Nesterov	cd92bf61d6	tracing/perf: Move the PERF_MAX_TRACE_SIZE check into perf_trace_buf_prepare() Every perf_trace_buf_prepare() caller does WARN_ONCE(size > PERF_MAX_TRACE_SIZE, message) and "message" is almost the same. Shift this WARN_ONCE() into perf_trace_buf_prepare(). This changes the meaning of _ONCE, but I think this is fine. - 4947014 2932448 10104832 17984294 1126b26 vmlinux + 4948422 2932448 10104832 17985702 11270a6 vmlinux on my build. Link: http://lkml.kernel.org/r/20130617170211.GA19813@redhat.com Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:28 -04:00
Oleg Nesterov	421c7860c6	tracing/syscall: Avoid perf_trace_buf_*() if sys_data->perf_events is empty perf_trace_buf_prepare() + perf_trace_buf_submit(head, task => NULL) make no sense if hlist_empty(head). Change perf_syscall_enter/exit() to check sys_data->{enter,exit}_event->perf_events beforehand. Link: http://lkml.kernel.org/r/20130617170207.GA19806@redhat.com Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:28 -04:00
Oleg Nesterov	b8ebfd3f71	tracing/function: Avoid perf_trace_buf_*() if event_function.perf_events is empty perf_trace_buf_prepare() + perf_trace_buf_submit(head, task => NULL) make no sense if hlist_empty(head). Change perf_ftrace_function_call() to check event_function.perf_events beforehand. Link: http://lkml.kernel.org/r/20130617170204.GA19803@redhat.com Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:27 -04:00
zhangwei(Jovi)	d611851b42	tracing: Typo fix on ring buffer comments There have some mismatch between comments with real function name, update it. This patch also add some missed function arguments description. Link: http://lkml.kernel.org/r/51E3B3B2.4080307@huawei.com Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:31:26 -04:00
zhangwei(Jovi)	146c3442f2	tracing: Use trace_seq_puts()/trace_seq_putc() where possible For string without format specifiers, use trace_seq_puts() or trace_seq_putc(). Link: http://lkml.kernel.org/r/51E3B3AC.1000605@huawei.com Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> [ fixed a trace_seq_putc(s, " ") to trace_seq_putc(s, ' ') ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-18 21:30:36 -04:00
Linus Torvalds	7a62711aac	Merge tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg KH: "Here are some driver core patches for 3.11-rc2. They aren't really bugfixes, but a bunch of new helper macros for drivers to properly create attribute groups, which drivers and subsystems need to fix up a ton of race issues with incorrectly creating sysfs files (binary and normal) after userspace has been told that the device is present. Also here is the ability to create binary files as attribute groups, to solve that race condition, which was impossible to do before this, so that's my fault the drivers were broken. The majority of the .c changes is indenting and moving code around a bit. It affects no existing code, but allows the large backlog of 70+ patches that I already have created to start flowing into the different subtrees, instead of having to live in my driver-core tree, causing merge nightmares in linux-next for the next few months. These were finalized too late for the -rc1 merge window, which is why they were didn't make that pull request, testing and review from others didn't happen until a few weeks ago, and then there's the whole distraction of the past few days, which prevented these from getting to you sooner, sorry about that. Oh, and there's a bugfix for the documentation build warning in here as well. All of these have been in linux-next this week, with no reported problems" * tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: driver-core: fix new kernel-doc warning in base/platform.c sysfs: use file mode defines from stat.h sysfs: add more helper macro's for (bin_)attribute(_groups) driver core: add default groups to struct class driver core: Introduce device_create_groups sysfs: prevent warning when only using binary attributes sysfs: add support for binary attributes in groups driver core: device.h: add RW and RO attribute macros sysfs.h: add BIN_ATTR macro sysfs.h: add ATTRIBUTE_GROUPS() macro sysfs.h: add __ATTR_RW() macro	2013-07-18 12:48:40 -07:00
Marcelo Tosatti	e04c5d76b0	remove sched notifier for cross-cpu migrations Linux as a guest on KVM hypervisor, the only user of the pvclock vsyscall interface, does not require notification on task migration because: 1. cpu ID number maps 1:1 to per-CPU pvclock time info. 2. per-CPU pvclock time info is updated if the underlying CPU changes. 3. that version is increased whenever underlying CPU changes. Which is sufficient to guarantee nanoseconds counter is calculated properly. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-07-18 12:29:30 +02:00
Yacine Belkadi	e69f61862a	sched: Fix some kernel-doc warnings When building the htmldocs (in verbose mode), scripts/kernel-doc reports the follwing type of warnings: Warning(kernel/sched/core.c:936): No description found for return value of 'task_curr' ... Fix those by: - adding the missing descriptions - using "Return" sections for the descriptions Signed-off-by: Yacine Belkadi <yacine.belkadi.1@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1373654747-2389-1-git-send-email-yacine.belkadi.1@gmail.com [ While at it, fix the cpupri_set() explanation. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-07-18 09:58:21 +02:00
Jiri Kosina	fd4363fff3	x86: Introduce int3 (breakpoint)-based instruction patching Introduce a method for run-time instruction patching on a live SMP kernel based on int3 breakpoint, completely avoiding the need for stop_machine(). The way this is achieved: - add a int3 trap to the address that will be patched - sync cores - update all but the first byte of the patched range - sync cores - replace the first byte (int3) by the first byte of replacing opcode - sync cores According to http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/01530.html synchronization after replacing "all but first" instructions should not be necessary (on Intel hardware), as the syncing after the subsequent patching of the first byte provides enough safety. But there's not only Intel HW out there, and we'd rather be on a safe side. If any CPU instruction execution would collide with the patching, it'd be trapped by the int3 breakpoint and redirected to the provided "handler" (which would typically mean just skipping over the patched region, acting as "nop" has been there, in case we are doing nop -> jump and jump -> nop transitions). Ftrace has been using this very technique since `08d636b` ("ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine") for ages already, and jump labels are another obvious potential user of this. Based on activities of Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> a few years ago. Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1307121102440.29788@pobox.suse.cz Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-07-16 17:55:29 -07:00
Greg Kroah-Hartman	b9b3259746	sysfs.h: add __ATTR_RW() macro A number of parts of the kernel created their own version of this, might as well have the sysfs core provide it instead. Reviewed-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-07-16 10:57:36 -07:00
Tejun Heo	a698b4488a	cgroup: remove gratuituous BUG_ON()s from rebind_subsystems() rebind_subsystems() performs santiy checks even on subsystems which aren't specified to be added or removed and the checks aren't all that useful given that these are in a very cold path while the violations they check would trip up in much hotter paths. Let's remove these from rebind_subsystems(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-07-16 05:28:24 -06:00
Tejun Heo	1d5be6b287	cgroup: move module ref handling into rebind_subsystems() Module ref handling in cgroup is rather weird. parse_cgroupfs_options() grabs all the modules for the specified subsystems. A module ref is kept if the specified subsystem is newly bound to the hierarchy. If not, or the operation fails, the refs are dropped. This scatters module ref handling across multiple functions making it difficult to track. It also make the function nasty to use for dynamic subsystem binding which is necessary for the planned unified hierarchy. There's nothing which requires the subsystem modules to be pinned between parse_cgroupfs_options() and rebind_subsystems() in both mount and remount paths. parse_cgroupfs_options() can just parse and rebind_subsystems() can handle pinning the subsystems that it wants to bind, which is a natural part of its task - binding - anyway. Move module ref handling into rebind_subsystems() which makes the code a lot simpler - modules are gotten iff it's gonna be bound and put iff unbound or binding fails. v2: Li pointed out that if a controller module is unloaded between parsing and binding, rebind_subsystems() won't notice the missing controller as it only iterates through existing controllers. Fix it by updating rebind_subsystems() to compare @added_mask to @pinned and fail with -ENOENT if they don't match. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-07-16 05:28:24 -06:00
zhangwei(Jovi)	991821c86c	tracing: Use correct config guard CONFIG_STACK_TRACER We should use CONFIG_STACK_TRACER to guard readme text of stack tracer related file, not CONFIG_STACKTRACE. Link: http://lkml.kernel.org/r/51E3B3A2.8080609@huawei.com Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2013-07-15 12:38:10 -04:00
Paul Gortmaker	0db0628d90	kernel: delete __cpuinit usage from all core kernel files The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit `5e427ec2d0` ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the uses of the __cpuinit macros from C files in the core kernel directories (kernel, init, lib, mm, and include) that don't really have a specific maintainer. [1] https://lkml.org/lkml/2013/5/20/589 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2013-07-14 19:36:59 -04:00
Paul Gortmaker	49fb4c6290	rcu: delete __cpuinit usage from all rcu files The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit `5e427ec2d0` ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the drivers/rcu uses of the __cpuinit macros from all C files. [1] https://lkml.org/lkml/2013/5/20/589 Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@freedesktop.org> Cc: Dipankar Sarma <dipankar@in.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2013-07-14 19:36:58 -04:00
Liu ShuoX	e5248a111b	PM / Sleep: avoid 'autosleep' in shutdown progress Prevent automatic system suspend from happening during system shutdown by making try_to_suspend() check system_state and return immediately if it is not SYSTEM_RUNNING. This prevents the following breakage from happening (scenario from Zhang Yanmin): Kernel starts shutdown and calls all device driver's shutdown callback. When a driver's shutdown is called, the last wakelock is released and suspend-to-ram starts. However, as some driver's shut down callbacks already shut down devices and disabled runtime pm, the suspend-to-ram calls driver's suspend callback without noticing that device is already off and causes crash. [rjw: Changelog] Signed-off-by: Liu ShuoX <shuox.liu@intel.com> Cc: 3.5+ <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2013-07-15 01:31:37 +02:00
Linus Torvalds	41d9884c44	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs stuff from Al Viro: "O_TMPFILE ABI changes, Oleg's fput() series, misc cleanups, including making simple_lookup() usable for filesystems with non-NULL s_d_op, which allows us to get rid of quite a bit of ugliness" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: sunrpc: now we can just set ->s_d_op cgroup: we can use simple_lookup() now efivarfs: we can use simple_lookup() now make simple_lookup() usable for filesystems that set ->s_d_op configfs: don't open-code d_alloc_name() __rpc_lookup_create_exclusive: pass string instead of qstr rpc_create_*_dir: don't bother with qstr llist: llist_add() can use llist_add_batch() llist: fix/simplify llist_add() and llist_add_batch() fput: turn "list_head delayed_fput_list" into llist_head fs/file_table.c:fput(): add comment Safer ABI for O_TMPFILE	2013-07-14 11:42:26 -07:00
Al Viro	786e1448d9	cgroup: we can use simple_lookup() now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-07-14 17:50:23 +04:00
Linus Torvalds	f8acc450e1	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "Fix a potential deadlock versus hrtimers" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix HRTICK	2013-07-13 15:37:57 -07:00
Linus Torvalds	505608d2b9	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: - core fix for missing round up in the generic irq chip implementation - new irq chip for MOXA SoCs - a few fixes and cleanups in the irqchip drivers * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip: Add support for MOXA ART SoCs genirq: generic chip: Use DIV_ROUND_UP to calculate numchips irqchip: nvic: Fix wrong num_ct argument for irq_alloc_domain_generic_chips() irqchip: sun4i: Staticize sun4i_irq_ack() irqchip: vt8500: Staticize local symbols	2013-07-13 15:37:30 -07:00
Linus Torvalds	0da2736686	Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: - watchdog fixes for full dynticks - improved debug output for full dynticks - remove an obsolete full dynticks check - two ARM SoC clocksource drivers for sharing across SoCs - tick broadcast fix for CPU hotplug * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick: broadcast: Check broadcast mode on CPU hotplug clocksource: arm_global_timer: Add ARM global timer support clocksource: Add Marvell Orion SoC timer nohz: Remove obsolete check for full dynticks CPUs to be RCU nocbs watchdog: Boot-disable by default on full dynticks watchdog: Rename confusing state variable watchdog: Register / unregister watchdog kthreads on sysctl control nohz: Warn if the machine can not perform nohz_full	2013-07-13 15:36:09 -07:00
Linus Torvalds	560ae37178	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: - fix for do_div() abuse on x86 - locking fix in perf core - a pile of (build) fixes and cleanups in perf tools * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) perf/x86: Fix incorrect use of do_div() in NMI warning perf: Fix perf_lock_task_context() vs RCU perf: Remove WARN_ON_ONCE() check in __perf_event_enable() for valid scenario perf: Clone child context from parent context pmu perf script: Fix broken include in Context.xs perf tools: Fix -ldw/-lelf link test when static linking perf tools: Revert regression in configuration of Python support perf tools: Fix perf version generation perf stat: Fix per-socket output bug for uncore events perf symbols: Fix vdso list searching perf evsel: Fix missing increment in sample parsing perf tools: Update symbol_conf.nr_events when processing attribute events perf tools: Fix new_term() missing free on error path perf tools: Fix parse_events_terms() segfault on error path perf evsel: Fix count parameter to read call in event_format__new perf tools: fix a typo of a Power7 event name perf tools: Fix -x/--exclude-other option for report command perf evlist: Enhance perf_evlist__start_workload() perf record: Remove -f/--force option perf record: Remove -A/--append option ...	2013-07-13 15:35:47 -07:00
Linus Torvalds	4fa109b130	Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking updates from Thomas Gleixner: "Header cleanup as requested by Linus" (This is the "don't include support for ww_mutex in a header file that everybody wants, when almost nobody wants the ww part" change) * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mutex: Move ww_mutex definitions to ww_mutex.h	2013-07-13 15:35:12 -07:00
Linus Torvalds	d144746478	Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus Pull MIPS updates from Ralf Baechle: "MIPS updates: - All the things that didn't make 3.10. - Removes the Windriver PPMC platform. Nobody will miss it. - Remove a workaround from kernel/irq/irqdomain.c which was there exclusivly for MIPS. Patch by Grant Likely. - More small improvments for the SEAD 3 platform - Improvments on the BMIPS / SMP support for the BCM63xx series. - Various cleanups of dead leftovers. - Platform support for the Cavium Octeon-based EdgeRouter Lite. Two large KVM patchsets didn't make it for this pull request because their respective authors are vacationing" * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (124 commits) MIPS: Kconfig: Add missing MODULES dependency to VPE_LOADER MIPS: BCM63xx: CLK: Add dummy clk_{set,round}_rate() functions MIPS: SEAD3: Disable L2 cache on SEAD-3. MIPS: BCM63xx: Enable second core SMP on BCM6328 if available MIPS: BCM63xx: Add SMP support to prom.c MIPS: define write{b,w,l,q}_relaxed MIPS: Expose missing pci_io{map,unmap} declarations MIPS: Malta: Update GCMP detection. Revert "MIPS: make CAC_ADDR and UNCAC_ADDR account for PHYS_OFFSET" MIPS: APSP: Remove <asm/kspd.h> SSB: Kconfig: Amend SSB_EMBEDDED dependencies MIPS: microMIPS: Fix improper definition of ISA exception bit. MIPS: Don't try to decode microMIPS branch instructions where they cannot exist. MIPS: Declare emulate_load_store_microMIPS as a static function. MIPS: Fix typos and cleanup comment MIPS: Cleanup indentation and whitespace MIPS: BMIPS: support booting from physical CPU other than 0 MIPS: Only set cpu_has_mmips if SYS_SUPPORTS_MICROMIPS MIPS: GIC: Fix gic_set_affinity infinite loop MIPS: Don't save/restore OCTEON wide multiplier state on syscalls. ...	2013-07-13 14:52:21 -07:00
Tejun Heo	913ffdb543	cgroup: replace task_cgroup_path_from_hierarchy() with task_cgroup_path() task_cgroup_path_from_hierarchy() was added for the planned new users and none of the currently planned users wants to know about multiple hierarchies. This patch drops the multiple hierarchy part and makes it always return the path in the first non-dummy hierarchy. As unified hierarchy will always have id 1, this is guaranteed to return the path for the unified hierarchy if mounted; otherwise, it will return the path from the hierarchy which happens to occupy the lowest hierarchy id, which will usually be the first hierarchy mounted after boot. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Lennart Poettering <lennart@poettering.net> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kaluža <jkaluza@redhat.com>	2013-07-12 12:49:05 -07:00
Tejun Heo	f172e67cf9	cgroup: move number_of_cgroups test out of rebind_subsystems() into cgroup_remount() rebind_subsystems() currently fails if the hierarchy has any !root cgroups; however, on the planned unified hierarchy, rebind_subsystems() will be used while populated. Move the test to cgroup_remount(), which is the only place the test is necessary anyway. As it's impossible for the other two callers of rebind_subsystems() to have populated hierarchy, this doesn't make any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-07-12 12:34:02 -07:00
Tejun Heo	3126121fb3	cgroup: make rebind_subsystems() handle file additions and removals with proper error handling Currently, creating and removing cgroup files in the root directory are handled separately from the actual subsystem binding and unbinding which happens in rebind_subsystems(). Also, rebind_subsystems() users aren't handling file creation errors properly. Let's integrate top_cgroup file handling into rebind_subsystems() so that it's simpler to use and everyone handles file creation errors correctly. * On a successful return, rebind_subsystems() is guaranteed to have created all files of the new subsystems and deleted the ones belonging to the removed subsystems. After a failure, no file is created or removed. * cgroup_remount() no longer needs to make explicit populate/clear calls as it's all handled by rebind_subsystems(), and it gets proper error handling automatically. * cgroup_mount() has been updated such that the root dentry and cgroup are linked before rebind_subsystems(). Also, the init_cred dancing and base file handling are moved right above rebind_subsystems() call and proper error handling for the base files is added. While at it, add a comment explaining what's going on with the cred thing. * cgroup_kill_sb() calls rebind_subsystems() to unbind all subsystems which now implies removing all subsystem files which requires the directory's i_mutex. Grab it. This means that files on the root cgroup are removed earlier - they used to be deleted from generic super_block cleanup from vfs. This doesn't lead to any functional difference and it's cleaner to do the clean up explicitly for all files. Combined with the previous changes, this makes all cgroup file creation errors handled correctly. v2: Added comment on init_cred. v3: Li spotted that cgroup_mount() wasn't freeing tmp_links after base file addition failure. Fix it by adding free_tmp_links error handling label. v4: v3 introduced build bugs which got noticed by Fengguang's awesome kbuild test robot. Fixed, and shame on me. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com>	2013-07-12 12:34:02 -07:00

... 13 14 15 16 17 ...

16923 Commits