Commit Graph

20257 Commits

Author SHA1 Message Date
Paul E. McKenney
b6a932d1d9 rcu: Make rcu_read_unlock_special() propagate ->qsmaskinit bit clearing
This commit causes rcu_read_unlock_special() to propagate ->qsmaskinit
bit clearing up the rcu_node tree once a given rcu_node structure's
blkd_tasks list becomes empty.  This is the final commit in preparation
for the rework of RCU priority boosting:  It enables preempted tasks to
remain queued on their rcu_node structure even after all of that rcu_node
structure's CPUs have gone offline.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:43 -08:00
Paul E. McKenney
8af3a5e78c rcu: Abstract rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
This commit abstracts rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
in preparation for the rework of RCU priority boosting.  This new function
will be invoked from rcu_read_unlock_special() in the reworked scheme,
which is why rcu_cleanup_dead_rnp() assumes that the leaf rcu_node
structure's ->qsmaskinit field has already been updated.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:41 -08:00
Paul E. McKenney
74e871ac6c rcu: Rename "empty" to "empty_norm" in preparation for boost rework
This commit undertakes a simple variable renaming to make way for
some rework of RCU priority boosting.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:40 -08:00
Paul E. McKenney
b08ea27d95 rcu: Protect rcu_boost() lockless accesses with ACCESS_ONCE()
This commit prevents random compiler optimizations by applying
ACCESS_ONCE() to lockless accesses.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:39 -08:00
Lai Jiangshan
5a43b88e98 rcu: Remove "select IRQ_WORK" from config TREE_RCU
The 48a7639ce8 ("rcu: Make callers awaken grace-period kthread")
removed the irq_work_queue(), so the TREE_RCU doesn't need
irq work any more.  This commit therefore updates RCU's Kconfig and

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:01:17 -08:00
Paul E. McKenney
41050a0096 rcu: Fix rcu_barrier() race that could result in too-short wait
The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions.
It checks a given CPU's lists of callbacks, and if all three no-CBs lists
are empty, ignores that CPU.  However, these three lists could potentially
be empty even when callbacks are present if the check executed just as
the callbacks were being moved from one list to another.  It turns out
that recent versions of rcutorture can spot this race.

This commit plugs this hole by consolidating the per-list counts of
no-CBs callbacks into a single count, which is incremented before
the corresponding callback is posted and after it is invoked.  Then
rcu_barrier() checks this single count to reliably determine whether
the corresponding CPU has no-CBs callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:01:15 -08:00
David Hildenbrand
87af9e7ff9 hotplugcpu: Avoid deadlocks by waking active_writer
Commit b2c4623dcd ("rcu: More on deadlock between CPU hotplug and expedited
grace periods") introduced another problem that can easily be reproduced by
starting/stopping cpus in a loop.

E.g.:
  for i in `seq 5000`; do
      echo 1 > /sys/devices/system/cpu/cpu1/online
      echo 0 > /sys/devices/system/cpu/cpu1/online
  done

Will result in:
  INFO: task /cpu_start_stop:1 blocked for more than 120 seconds.
  Call Trace:
  ([<00000000006a028e>] __schedule+0x406/0x91c)
   [<0000000000130f60>] cpu_hotplug_begin+0xd0/0xd4
   [<0000000000130ff6>] _cpu_up+0x3e/0x1c4
   [<0000000000131232>] cpu_up+0xb6/0xd4
   [<00000000004a5720>] device_online+0x80/0xc0
   [<00000000004a57f0>] online_store+0x90/0xb0
  ...

And a deadlock.

Problem is that if the last ref in put_online_cpus() can't get the
cpu_hotplug.lock the puts_pending count is incremented, but a sleeping
active_writer might never be woken up, therefore never exiting the loop in
cpu_hotplug_begin().

This fix removes puts_pending and turns refcount into an atomic variable. We
also introduce a wait queue for the active_writer, to avoid possible races and
use-after-free. There is no need to take the lock in put_online_cpus() anymore.

Can't reproduce it with this fix.

Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:01:14 -08:00
Lai Jiangshan
5f6130fa52 tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task
For RCU in UP, context-switch = QS = GP, thus we can force a
context-switch when any call_rcu_[bh|sched]() is happened on idle_task.
After doing so, rcu_idle/irq_enter/exit() are useless, so we can simply
make these functions empty.

More important, this change does not change the functionality logically.
Note: raise_softirq(RCU_SOFTIRQ)/rcu_sched_qs() in rcu_idle_enter() and
outmost rcu_irq_exit() will have to wake up the ksoftirqd
(due to in_interrupt() == 0).

Before this patch		After this patch:
call_rcu_sched() in idle;	call_rcu_sched() in idle
				  set resched
do other stuffs;		do other stuffs
outmost rcu_irq_exit()		outmost rcu_irq_exit() (empty function)
  (or rcu_idle_enter())		  (or rcu_idle_enter(), also empty function)
				start to resched. (see above)
  rcu_sched_qs()		rcu_sched_qs()
    QS,and GP and advance cb	  QS,and GP and advance cb
    wake up the ksoftirqd	    wake up the ksoftirqd
      set resched
resched to ksoftirqd (or other)	resched to ksoftirqd (or other)

These two code patches are almost the same.

Size changed after patched:

size kernel/rcu/tiny-old.o kernel/rcu/tiny-patched.o
   text	   data	    bss	    dec	    hex	filename
   3449	    206	      8	   3663	    e4f	kernel/rcu/tiny-old.o
   2406	    144	      8	   2558	    9fe	kernel/rcu/tiny-patched.o

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:01:12 -08:00
Thomas Graf
113948d841 spinlock: Add spin_lock_bh_nested()
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03 14:32:57 -05:00
Linus Torvalds
5e0f872c7d Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit fix from Paul Moore:
 "One audit patch to resolve a panic/oops when recording filenames in
  the audit log, see the mail archive link below.

  The fix isn't as nice as I would like, as it involves an allocate/copy
  of the filename, but it solves the problem and the overhead should
  only affect users who have configured audit rules involving file
  names.

  We'll revisit this issue with future kernels in an attempt to make
  this suck less, but in the meantime I think this fix should go into
  the next release of v3.19-rcX.

  [ https://marc.info/?t=141986927600001&r=1&w=2 ]"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  audit: create private file name copies when auditing inodes
2014-12-31 14:52:18 -08:00
Paul E. McKenney
924df8a011 rcu: Fix invoke_rcu_callbacks() comment
Despite what the comment says, it is only softirqs that are disabled,
not interrupts.  This commit therefore fixes the comment.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-12-30 17:40:19 -08:00
Alexander Gordeev
ca9558a33f rcu: Remove redundant rcu_is_cpu_rrupt_from_idle() from tiny RCU
Let's start assuming that something in the idle loop posts a callback,
and scheduling-clock interrupt occurs:

1. The system is idle and stays that way, no runnable tasks.

2. Scheduling-clock interrupt occurs, rcu_check_callbacks() is called
   as result, which in turn calls rcu_is_cpu_rrupt_from_idle().

3. rcu_is_cpu_rrupt_from_idle() reports the CPU was interrupted from
   idle, which results in rcu_sched_qs() call, which does a
   raise_softirq(RCU_SOFTIRQ).

4. Upon return from interrupt, rcu_irq_exit() is invoked, which calls
   rcu_idle_enter_common(), which in turn calls rcu_sched_qs() again,
   which does another raise_softirq(RCU_SOFTIRQ).

5. The softirq happens shortly and invokes rcu_process_callbacks(),
   which invokes __rcu_process_callbacks().

6. So now callbacks can be invoked. At least they can be if
   ->donetail has been updated. Which it will have been because
   rcu_sched_qs() invokes rcu_qsctr_help().

In the described scenario rcu_sched_qs() and raise_softirq(RCU_SOFTIRQ)
get called twice in steps 3 and 4. This redundancy could be eliminated
by removing rcu_is_cpu_rrupt_from_idle() function.

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-12-30 17:40:18 -08:00
Paul E. McKenney
734d168013 rcu: Make rcu_nmi_enter() handle nesting
The x86 architecture has multiple types of NMI-like interrupts: real
NMIs, machine checks, and, for some values of NMI-like, debugging
and breakpoint interrupts.  These interrupts can nest inside each
other.  Andy Lutomirski is adding RCU support to these interrupts,
so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.

This commit therefore introduces nesting, using a clever NMI-coordination
algorithm suggested by Andy.  The trick is to atomically increment
->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry
(and, accordingly, after on exit).  In addition, ->dynticks_nmi_nesting
is incremented by one if ->dynticks was incremented and by two otherwise.
This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal
to one, it knows that ->dynticks must be atomically incremented.

This NMI-coordination algorithms has been validated by the following
Promela model:

------------------------------------------------------------------------

/*
 * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
 * that allows nesting.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, you can access it online at
 * http://www.gnu.org/licenses/gpl-2.0.html.
 *
 * Copyright IBM Corporation, 2014
 *
 * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
 */

byte dynticks_nmi_nesting = 0;
byte dynticks = 0;

/*
 * Promela verision of rcu_nmi_enter().
 */
inline rcu_nmi_enter()
{
	byte incby;
	byte tmp;

	incby = BUSY_INCBY;
	assert(dynticks_nmi_nesting >= 0);
	if
	:: (dynticks & 1) == 0 ->
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 1);
		incby = 1;
	:: else ->
		skip;
	fi;
	tmp = dynticks_nmi_nesting;
	tmp = tmp + incby;
	dynticks_nmi_nesting = tmp;
	assert(dynticks_nmi_nesting >= 1);
}

/*
 * Promela verision of rcu_nmi_exit().
 */
inline rcu_nmi_exit()
{
	byte tmp;

	assert(dynticks_nmi_nesting > 0);
	assert((dynticks & 1) != 0);
	if
	:: dynticks_nmi_nesting != 1 ->
		tmp = dynticks_nmi_nesting;
		tmp = tmp - BUSY_INCBY;
		dynticks_nmi_nesting = tmp;
	:: else ->
		dynticks_nmi_nesting = 0;
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 0);
	fi;
}

/*
 * Base-level NMI runs non-atomically.  Crudely emulates process-level
 * dynticks-idle entry/exit.
 */
proctype base_NMI()
{
	byte busy;

	busy = 0;
	do
	::	/* Emulate base-level dynticks and not. */
		if
		:: 1 ->	atomic {
				dynticks = dynticks + 1;
			}
			busy = 1;
		:: 1 ->	skip;
		fi;

		/* Verify that we only sometimes have base-level dynticks. */
		if
		:: busy == 0 -> skip;
		:: busy == 1 -> skip;
		fi;

		/* Model RCU's NMI entry and exit actions. */
		rcu_nmi_enter();
		assert((dynticks & 1) == 1);
		rcu_nmi_exit();

		/* Emulated re-entering base-level dynticks and not. */
		if
		:: !busy -> skip;
		:: busy ->
			atomic {
				dynticks = dynticks + 1;
			}
			busy = 0;
		fi;

		/* We had better now be in dyntick-idle mode. */
		assert((dynticks & 1) == 0);
	od;
}

/*
 * Nested NMI runs atomically to emulate interrupting base_level().
 */
proctype nested_NMI()
{
	do
	::	/*
		 * Use an atomic section to model a nested NMI.  This is
		 * guaranteed to interleave into base_NMI() between a pair
		 * of base_NMI() statements, just as a nested NMI would.
		 */
		atomic {
			/* Verify that we only sometimes are in dynticks. */
			if
			:: (dynticks & 1) == 0 -> skip;
			:: (dynticks & 1) == 1 -> skip;
			fi;

			/* Model RCU's NMI entry and exit actions. */
			rcu_nmi_enter();
			assert((dynticks & 1) == 1);
			rcu_nmi_exit();
		}
	od;
}

init {
	run base_NMI();
	run nested_NMI();
}

------------------------------------------------------------------------

The following script can be used to run this model if placed in
rcu_nmi.spin:

------------------------------------------------------------------------

if ! spin -a rcu_nmi.spin
then
	echo Spin errors!!!
	exit 1
fi
if ! cc -DSAFETY -o pan pan.c
then
	echo Compilation errors!!!
	exit 1
fi
./pan -m100000

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-12-30 17:40:16 -08:00
Richard Cochran
2eebdde652 timecounter: keep track of accumulated fractional nanoseconds
The current timecounter implementation will drop a variable amount
of resolution, depending on the magnitude of the time delta. In
other words, reading the clock too often or too close to a time
stamp conversion will introduce errors into the time values. This
patch fixes the issue by introducing a fractional nanosecond field
that accumulates the low order bits.

Reported-by: Janusz Użycki <j.uzycki@elproma.com.pl>
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-30 18:29:27 -05:00
Richard Cochran
74d23cc704 time: move the timecounter/cyclecounter code into its own file.
The timecounter code has almost nothing to do with the clocksource
code. Let it live in its own file. This will help isolate the
timecounter users from the clocksource users in the source tree.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-30 18:29:25 -05:00
Linus Torvalds
2c90331cf5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix double SKB free in bluetooth 6lowpan layer, from Jukka Rissanen.

 2) Fix receive checksum handling in enic driver, from Govindarajulu
    Varadarajan.

 3) Fix NAPI poll list corruption in virtio_net and caif_virtio, from
    Herbert Xu.  Also, add code to detect drivers that have this mistake
    in the future.

 4) Fix doorbell endianness handling in mlx4 driver, from Amir Vadai.

 5) Don't clobber IP6CB() before xfrm6_policy_check() is called in TCP
    input path,f rom Nicolas Dichtel.

 6) Fix MPLS action validation in openvswitch, from Pravin B Shelar.

 7) Fix double SKB free in vxlan driver, also from Pravin.

 8) When we scrub a packet, which happens when we are switching the
    context of the packet (namespace, etc.), we should reset the
    secmark.  From Thomas Graf.

 9) ->ndo_gso_check() needs to do more than return true/false, it also
    has to allow the driver to clear netdev feature bits in order for
    the caller to be able to proceed properly.  From Jesse Gross.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
  genetlink: A genl_bind() to an out-of-range multicast group should not WARN().
  netlink/genetlink: pass network namespace to bind/unbind
  ne2k-pci: Add pci_disable_device in error handling
  bonding: change error message to debug message in __bond_release_one()
  genetlink: pass multicast bind/unbind to families
  netlink: call unbind when releasing socket
  netlink: update listeners directly when removing socket
  genetlink: pass only network namespace to genl_has_listeners()
  netlink: rename netlink_unbind() to netlink_undo_bind()
  net: Generalize ndo_gso_check to ndo_features_check
  net: incorrect use of init_completion fixup
  neigh: remove next ptr from struct neigh_table
  net: xilinx: Remove unnecessary temac_property in the driver
  net: phy: micrel: use generic config_init for KSZ8021/KSZ8031
  net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
  openvswitch: fix odd_ptr_err.cocci warnings
  Bluetooth: Fix accepting connections when not using mgmt
  Bluetooth: Fix controller configuration with HCI_QUIRK_INVALID_BDADDR
  brcmfmac: Do not crash if platform data is not populated
  ipw2200: select CFG80211_WEXT
  ...
2014-12-30 10:45:47 -08:00
Paul Moore
fcf22d8267 audit: create private file name copies when auditing inodes
Unfortunately, while commit 4a928436 ("audit: correctly record file
names with different path name types") fixed a problem where we were
not recording filenames, it created a new problem by attempting to use
these file names after they had been freed.  This patch resolves the
issue by creating a copy of the filename which the audit subsystem
frees after it is done with the string.

At some point it would be nice to resolve this issue with refcounts,
or something similar, instead of having to allocate/copy strings, but
that is almost surely beyond the scope of a -rcX patch so we'll defer
that for later.  On the plus side, only audit users should be impacted
by the string copying.

Reported-by: Toralf Foerster <toralf.foerster@gmx.de>
Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-12-30 09:26:21 -05:00
Johannes Berg
023e2cfa36 netlink/genetlink: pass network namespace to bind/unbind
Netlink families can exist in multiple namespaces, and for the most
part multicast subscriptions are per network namespace. Thus it only
makes sense to have bind/unbind notifications per network namespace.

To achieve this, pass the network namespace of a given client socket
to the bind/unbind functions.

Also do this in generic netlink, and there also make sure that any
bind for multicast groups that only exist in init_net is rejected.
This isn't really a problem if it is accepted since a client in a
different namespace will never receive any notifications from such
a group, but it can confuse the family if not rejected (it's also
possible to silently (without telling the family) accept it, but it
would also have to be ignored on unbind so families that take any
kind of action on bind/unbind won't do unnecessary work for invalid
clients like that.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-27 03:07:50 -05:00
Linus Torvalds
66b3f4f0a0 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "Four patches to fix various problems with the audit subsystem, all are
  fairly small and straightforward.

  One patch fixes a problem where we weren't using the correct gfp
  allocation flags (GFP_KERNEL regardless of context, oops), one patch
  fixes a problem with old userspace tools (this was broken for a
  while), one patch fixes a problem where we weren't recording pathnames
  correctly, and one fixes a problem with PID based filters.

  In general I don't think there is anything controversial with this
  patchset, and it fixes some rather unfortunate bugs; the allocation
  flag one can be particularly scary looking for users"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  audit: restore AUDIT_LOGINUID unset ABI
  audit: correctly record file names with different path name types
  audit: use supplied gfp_mask from audit_buffer in kauditd_send_multicast_skb
  audit: don't attempt to lookup PIDs when changing PID filtering audit rules
2014-12-23 18:13:16 -08:00
Richard Guy Briggs
041d7b98ff audit: restore AUDIT_LOGINUID unset ABI
A regression was caused by commit 780a7654ce:
	 audit: Make testing for a valid loginuid explicit.
(which in turn attempted to fix a regression caused by e1760bd)

When audit_krule_to_data() fills in the rules to get a listing, there was a
missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.

This broke userspace by not returning the same information that was sent and
expected.

The rule:
	auditctl -a exit,never -F auid=-1
gives:
	auditctl -l
		LIST_RULES: exit,never f24=0 syscall=all
when it should give:
		LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all

Tag it so that it is reported the same way it was set.  Create a new
private flags audit_krule field (pflags) to store it that won't interact with
the public one from the API.

Cc: stable@vger.kernel.org # v3.10-rc1+
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-12-23 16:40:18 -05:00
Alex Thorlton
b74e6278fd sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation
When allocating space for load_balance_mask, in sched_init, when
CPUMASK_OFFSTACK is set, we've managed to spill over
KMALLOC_MAX_SIZE on our 6144 core machine.  The patch below
breaks up the allocations so that they don't overflow the max
alloc size.  It also allocates the masks on the the node from
which they'll most commonly be accessed, to minimize remote
accesses on NUMA machines.

Suggested-by: George Beshers <gbeshers@sgi.com>
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: George Beshers <gbeshers@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-23 11:43:48 +01:00
Steven Rostedt (Red Hat)
d716ff71dd tracing: Remove taking of trace_types_lock in pipe files
Taking the global mutex "trace_types_lock" in the trace_pipe files
causes a bottle neck as most the pipe files can be read per cpu
and there's no reason to serialize them.

The current_trace variable was given a ref count and it can not
change when the ref count is not zero. Opening the trace_pipe
files will up the ref count (and decremented on close), so that
the lock no longer needs to be taken when accessing the
current_trace variable.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-22 23:37:46 -05:00
Rusty Russell
574732c73d param: initialize store function to NULL if not available.
I rebased Kees' 'param: do not set store func without write perm'
on top of my 'params: cleanup sysfs allocation'.  However, my patch
uses krealloc which doesn't zero memory, leaving .store unset.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2014-12-23 15:07:41 +10:30
Steven Rostedt (Red Hat)
cf6ab6d914 tracing: Add ref count to tracer for when they are being read by pipe
When one of the trace pipe files are being read (by either the trace_pipe
or trace_pipe_raw), do not allow the current_trace to change. By adding
a ref count that is incremented when the pipe files are opened, will
prevent the current_trace from being changed.

This will allow for the removal of the global trace_types_lock from
reading the pipe buffers (which is currently a bottle neck).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-22 15:39:40 -05:00
Josh Poimboeuf
33e8612f64 livepatch: use FTRACE_OPS_FL_IPMODIFY
Use the FTRACE_OPS_FL_IPMODIFY flag to prevent conflicts with other
ftrace users who also modify regs->ip.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-12-22 20:05:59 +01:00
Paul Moore
4a92843601 audit: correctly record file names with different path name types
There is a problem with the audit system when multiple audit records
are created for the same path, each with a different path name type.
The root cause of the problem is in __audit_inode() when an exact
match (both the path name and path name type) is not found for a
path name record; the existing code creates a new path name record,
but it never sets the path name in this record, leaving it NULL.
This patch corrects this problem by assigning the path name to these
newly created records.

There are many ways to reproduce this problem, but one of the
easiest is the following (assuming auditd is running):

  # mkdir /root/tmp/test
  # touch /root/tmp/test/567
  # auditctl -a always,exit -F dir=/root/tmp/test
  # touch /root/tmp/test/567

Afterwards, or while the commands above are running, check the audit
log and pay special attention to the PATH records.  A faulty kernel
will display something like the following for the file creation:

  type=SYSCALL msg=audit(1416957442.025:93): arch=c000003e syscall=2
    success=yes exit=3 ... comm="touch" exe="/usr/bin/touch"
  type=CWD msg=audit(1416957442.025:93):  cwd="/root/tmp"
  type=PATH msg=audit(1416957442.025:93): item=0 name="test/"
    inode=401409 ... nametype=PARENT
  type=PATH msg=audit(1416957442.025:93): item=1 name=(null)
    inode=393804 ... nametype=NORMAL
  type=PATH msg=audit(1416957442.025:93): item=2 name=(null)
    inode=393804 ... nametype=NORMAL

While a patched kernel will show the following:

  type=SYSCALL msg=audit(1416955786.566:89): arch=c000003e syscall=2
    success=yes exit=3 ... comm="touch" exe="/usr/bin/touch"
  type=CWD msg=audit(1416955786.566:89):  cwd="/root/tmp"
  type=PATH msg=audit(1416955786.566:89): item=0 name="test/"
    inode=401409 ... nametype=PARENT
  type=PATH msg=audit(1416955786.566:89): item=1 name="test/567"
    inode=393804 ... nametype=NORMAL

This issue was brought up by a number of people, but special credit
should go to hujianyang@huawei.com for reporting the problem along
with an explanation of the problem and a patch.  While the original
patch did have some problems (see the archive link below), it did
demonstrate the problem and helped kickstart the fix presented here.

  * https://lkml.org/lkml/2014/9/5/66

Reported-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
2014-12-22 12:27:39 -05:00
Li Bin
b5bfc51707 livepatch: move x86 specific ftrace handler code to arch/x86
The execution flow redirection related implemention in the livepatch
ftrace handler is depended on the specific architecture. This patch
introduces klp_arch_set_pc(like kgdb_arch_set_pc) interface to change
the pt_regs.

Signed-off-by: Li Bin <huawei.libin@huawei.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-12-22 15:40:49 +01:00
Seth Jennings
b700e7f03d livepatch: kernel: add support for live patching
This commit introduces code for the live patching core.  It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.

It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.

This first version does not implement any consistency mechanism that
ensures that old and new code do not run together.  In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check.  However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.

[ jkosina@suse.cz: due to the number of contributions that got folded into
  this original patch from Seth Jennings, add SUSE's copyright as well, as
  discussed via e-mail ]

Signed-off-by: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-12-22 15:40:49 +01:00
Seth Jennings
c5f4546593 livepatch: kernel: add TAINT_LIVEPATCH
This adds a new taint flag to indicate when the kernel or a kernel
module has been live patched.  This will provide a clean indication in
bug reports that live patching was used.

Additionally, if the crash occurs in a live patched function, the live
patch module will appear beside the patched function in the backtrace.

Signed-off-by: Seth Jennings <sjenning@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-12-22 15:40:48 +01:00
Linus Torvalds
5d6a546886 Merge tag 'pm-config-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull CONFIG_PM_RUNTIME elimination from Rafael Wysocki:
 "This removes the last few uses of CONFIG_PM_RUNTIME introduced
  recently and makes that config option finally go away.

  CONFIG_PM will be available directly from the menu now and also it
  will be selected automatically if CONFIG_SUSPEND or CONFIG_HIBERNATION
  is set"

* tag 'pm-config-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: Eliminate CONFIG_PM_RUNTIME
  tty: 8250_omap: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  sound: sst-haswell-pcm: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  spi: Replace CONFIG_PM_RUNTIME with CONFIG_PM
2014-12-20 13:37:44 -08:00
Richard Guy Briggs
54dc77d974 audit: use supplied gfp_mask from audit_buffer in kauditd_send_multicast_skb
Eric Paris explains: Since kauditd_send_multicast_skb() gets called in
audit_log_end(), which can come from any context (aka even a sleeping context)
GFP_KERNEL can't be used.  Since the audit_buffer knows what context it should
use, pass that down and use that.

See: https://lkml.org/lkml/2014/12/16/542

BUG: sleeping function called from invalid context at mm/slab.c:2849
in_atomic(): 1, irqs_disabled(): 0, pid: 885, name: sulogin
2 locks held by sulogin/885:
  #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff91152e30>] prepare_bprm_creds+0x28/0x8b
  #1:  (tty_files_lock){+.+.+.}, at: [<ffffffff9123e787>] selinux_bprm_committing_creds+0x55/0x22b
CPU: 1 PID: 885 Comm: sulogin Not tainted 3.18.0-next-20141216 #30
Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A15 06/20/2014
  ffff880223744f10 ffff88022410f9b8 ffffffff916ba529 0000000000000375
  ffff880223744f10 ffff88022410f9e8 ffffffff91063185 0000000000000006
  0000000000000000 0000000000000000 0000000000000000 ffff88022410fa38
Call Trace:
  [<ffffffff916ba529>] dump_stack+0x50/0xa8
  [<ffffffff91063185>] ___might_sleep+0x1b6/0x1be
  [<ffffffff910632a6>] __might_sleep+0x119/0x128
  [<ffffffff91140720>] cache_alloc_debugcheck_before.isra.45+0x1d/0x1f
  [<ffffffff91141d81>] kmem_cache_alloc+0x43/0x1c9
  [<ffffffff914e148d>] __alloc_skb+0x42/0x1a3
  [<ffffffff914e2b62>] skb_copy+0x3e/0xa3
  [<ffffffff910c263e>] audit_log_end+0x83/0x100
  [<ffffffff9123b8d3>] ? avc_audit_pre_callback+0x103/0x103
  [<ffffffff91252a73>] common_lsm_audit+0x441/0x450
  [<ffffffff9123c163>] slow_avc_audit+0x63/0x67
  [<ffffffff9123c42c>] avc_has_perm+0xca/0xe3
  [<ffffffff9123dc2d>] inode_has_perm+0x5a/0x65
  [<ffffffff9123e7ca>] selinux_bprm_committing_creds+0x98/0x22b
  [<ffffffff91239e64>] security_bprm_committing_creds+0xe/0x10
  [<ffffffff911515e6>] install_exec_creds+0xe/0x79
  [<ffffffff911974cf>] load_elf_binary+0xe36/0x10d7
  [<ffffffff9115198e>] search_binary_handler+0x81/0x18c
  [<ffffffff91153376>] do_execveat_common.isra.31+0x4e3/0x7b7
  [<ffffffff91153669>] do_execve+0x1f/0x21
  [<ffffffff91153967>] SyS_execve+0x25/0x29
  [<ffffffff916c61a9>] stub_execve+0x69/0xa0

Cc: stable@vger.kernel.org #v3.16-rc1
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-12-19 18:37:56 -05:00
Paul Moore
3640dcfa4f audit: don't attempt to lookup PIDs when changing PID filtering audit rules
Commit f1dc4867 ("audit: anchor all pid references in the initial pid
namespace") introduced a find_vpid() call when adding/removing audit
rules with PID/PPID filters; unfortunately this is problematic as
find_vpid() only works if there is a task with the associated PID
alive on the system.  The following commands demonstrate a simple
reproducer.

	# auditctl -D
	# auditctl -l
	# autrace /bin/true
	# auditctl -l

This patch resolves the problem by simply using the PID provided by
the user without any additional validation, e.g. no calls to check to
see if the task/PID exists.

Cc: stable@vger.kernel.org # 3.15
Cc: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
2014-12-19 18:35:53 -05:00
Rafael J. Wysocki
464ed18ebd PM: Eliminate CONFIG_PM_RUNTIME
Having switched over all of the users of CONFIG_PM_RUNTIME to use
CONFIG_PM directly, turn the latter into a user-selectable option
and drop the former entirely from the tree.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Kevin Hilman <khilman@linaro.org>
2014-12-19 22:55:06 +01:00
Linus Torvalds
4bb9374e0b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ update from Thomas Gleixner:
 "Remove the call into the nohz idle code from the fake 'idle' thread in
  the powerclamp driver along with the export of those functions which
  was smuggeled in via the thermal tree.  People have tried to hack
  around it in the nohz core code, but it just violates all rightful
  assumptions of that code about the only valid calling context (i.e.
  the proper idle task).

  The powerclamp trainwreck will still work, it just wont get the
  benefit of long idle sleeps"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/powerclamp: Remove tick_nohz_idle abuse
2014-12-19 13:29:20 -08:00
Linus Torvalds
ac88ee3b6c Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core fix from Thomas Gleixner:
 "A single fix plugging a long standing race between proc/stat and
  proc/interrupts access and freeing of interrupt descriptors"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Prevent proc race against freeing of irq descriptors
2014-12-19 13:26:08 -08:00
Linus Torvalds
88a57667f2 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes and cleanups from Ingo Molnar:
 "A kernel fix plus mostly tooling fixes, but also some tooling
  restructuring and cleanups"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  perf: Fix building warning on ARM 32
  perf symbols: Fix use after free in filename__read_build_id
  perf evlist: Use roundup_pow_of_two
  tools: Adopt roundup_pow_of_two
  perf tools: Make the mmap length autotuning more robust
  tools: Adopt rounddown_pow_of_two and deps
  tools: Adopt fls_long and deps
  tools: Move bitops.h from tools/perf/util to tools/
  tools: Introduce asm-generic/bitops.h
  tools lib: Move asm-generic/bitops/find.h code to tools/include and tools/lib
  tools: Whitespace prep patches for moving bitops.h
  tools: Move code originally from asm-generic/atomic.h into tools/include/asm-generic/
  tools: Move code originally from linux/log2.h to tools/include/linux/
  tools: Move __ffs implementation to tools/include/asm-generic/bitops/__ffs.h
  perf evlist: Do not use hard coded value for a mmap_pages default
  perf trace: Let the perf_evlist__mmap autosize the number of pages to use
  perf evlist: Improve the strerror_mmap method
  perf evlist: Clarify sterror_mmap variable names
  perf evlist: Fixup brown paper bag on "hint" for --mmap-pages cmdline arg
  perf trace: Provide a better explanation when mmap fails
  ...
2014-12-19 13:15:24 -08:00
Thomas Gleixner
a5fd9733a3 tick/powerclamp: Remove tick_nohz_idle abuse
commit 4dbd27711c "tick: export nohz tick idle symbols for module
use" was merged via the thermal tree without an explicit ack from the
relevant maintainers.

The exports are abused by the intel powerclamp driver which implements
a fake idle state from a sched FIFO task. This causes all kinds of
wreckage in the NOHZ core code which rightfully assumes that
tick_nohz_idle_enter/exit() are only called from the idle task itself.

Recent changes in the NOHZ core lead to a failure of the powerclamp
driver and now people try to hack completely broken and backwards
workarounds into the NOHZ core code. This is completely unacceptable
and just papers over the real problem. There are way more subtle
issues lurking around the corner.

The real solution is to fix the powerclamp driver by rewriting it with
a sane concept, but that's beyond the scope of this.

So the only solution for now is to remove the calls into the core NOHZ
code from the powerclamp trainwreck along with the exports. 

Fixes: d6d71ee4a1 "PM: Introduce Intel PowerClamp Driver"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Pan Jacob jun <jacob.jun.pan@intel.com>
Cc: LKP <lkp@01.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-12-19 14:05:52 +01:00
Linus Torvalds
d790be3863 Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
 "The exciting thing here is the getting rid of stop_machine on module
  removal.  This is possible by using a simple atomic_t for the counter,
  rather than our fancy per-cpu counter: it turns out that no one is
  doing a module increment per net packet, so the slowdown should be in
  the noise"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  param: do not set store func without write perm
  params: cleanup sysfs allocation
  kernel:module Fix coding style errors and warnings.
  module: Remove stop_machine from module unloading
  module: Replace module_ref with atomic_t refcnt
  lib/bug: Use RCU list ops for module_bug_list
  module: Unlink module with RCU synchronizing instead of stop_machine
  module: Wait for RCU synchronizing before releasing a module
2014-12-18 20:55:41 -08:00
Linus Torvalds
c0f486fde3 Merge tag 'pm+acpi-3.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management updates from Rafael Wysocki:
 "These are regression fixes (leds-gpio, ACPI backlight driver,
  operating performance points library, ACPI device enumeration
  messages, cpupower tool), other bug fixes (ACPI EC driver, ACPI device
  PM), some cleanups in the operating performance points (OPP)
  framework, continuation of CONFIG_PM_RUNTIME elimination, a couple of
  minor intel_pstate driver changes, a new MAINTAINERS entry for it and
  an ACPI fan driver change needed for better support of thermal
  management in user space.

  Specifics:

   - Fix a regression in leds-gpio introduced by a recent commit that
     inadvertently changed the name of one of the properties used by the
     driver (Fabio Estevam).

   - Fix a regression in the ACPI backlight driver introduced by a
     recent fix that missed one special case that had to be taken into
     account (Aaron Lu).

   - Drop the level of some new kernel messages from the ACPI core
     introduced by a recent commit to KERN_DEBUG which they should have
     used from the start and drop some other unuseful KERN_ERR messages
     printed by ACPI (Rafael J Wysocki).

   - Revert an incorrect commit modifying the cpupower tool (Prarit
     Bhargava).

   - Fix two regressions introduced by recent commits in the OPP library
     and clean up some existing minor issues in that code (Viresh
     Kumar).

   - Continue to replace CONFIG_PM_RUNTIME with CONFIG_PM throughout the
     tree (or drop it where that can be done) in order to make it
     possible to eliminate CONFIG_PM_RUNTIME (Rafael J Wysocki, Ulf
     Hansson, Ludovic Desroches).

     There will be one more "CONFIG_PM_RUNTIME removal" batch after this
     one, because some new uses of it have been introduced during the
     current merge window, but that should be sufficient to finally get
     rid of it.

   - Make the ACPI EC driver more robust against race conditions related
     to GPE handler installation failures (Lv Zheng).

   - Prevent the ACPI device PM core code from attempting to disable
     GPEs that it has not enabled which confuses ACPICA and makes it
     report errors unnecessarily (Rafael J Wysocki).

   - Add a "force" command line switch to the intel_pstate driver to
     make it possible to override the blacklisting of some systems in
     that driver if needed (Ethan Zhao).

   - Improve intel_pstate code documentation and add a MAINTAINERS entry
     for it (Kristen Carlson Accardi).

   - Make the ACPI fan driver create cooling device interfaces witn
     names that reflect the IDs of the ACPI device objects they are
     associated with, except for "generic" ACPI fans (PNP ID "PNP0C0B").

     That's necessary for user space thermal management tools to be able
     to connect the fans with the parts of the system they are supposed
     to be cooling properly.  From Srinivas Pandruvada"

* tag 'pm+acpi-3.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
  MAINTAINERS: add entry for intel_pstate
  ACPI / video: update the skip case for acpi_video_device_in_dod()
  power / PM: Eliminate CONFIG_PM_RUNTIME
  NFC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  SCSI / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  ACPI / EC: Fix unexpected ec_remove_handlers() invocations
  Revert "tools: cpupower: fix return checks for sysfs_get_idlestate_count()"
  tracing / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  x86 / PM: Replace CONFIG_PM_RUNTIME in io_apic.c
  PM: Remove the SET_PM_RUNTIME_PM_OPS() macro
  mmc: atmel-mci: use SET_RUNTIME_PM_OPS() macro
  PM / Kconfig: Replace PM_RUNTIME with PM in dependencies
  ARM / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  sound / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  phy / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  video / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  tty / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  spi: Replace CONFIG_PM_RUNTIME with CONFIG_PM
  ACPI / PM: Do not disable wakeup GPEs that have not been enabled
  ACPI / utils: Drop error messages from acpi_evaluate_reference()
  ...
2014-12-18 20:28:33 -08:00
Kees Cook
b0a65b0ccc param: do not set store func without write perm
When a module_param is defined without DAC write permissions, it can
still be changed at runtime and updated. Drivers using a 0444 permission
may be surprised that these values can still be changed.

For drivers that want to allow updates, any S_IW* flag will set the
"store" function as before. Drivers without S_IW* flags will have the
"store" function unset, unforcing a read-only value. Drivers that wish
neither "store" nor "get" can continue to use "0" for perms to stay out
of sysfs entirely.

Old behavior:
  # cd /sys/module/snd/parameters
  # ls -l
  total 0
  -r--r--r-- 1 root root 4096 Dec 11 13:55 cards_limit
  -r--r--r-- 1 root root 4096 Dec 11 13:55 major
  -r--r--r-- 1 root root 4096 Dec 11 13:55 slots
  # cat major
  116
  # echo -1 > major
  -bash: major: Permission denied
  # chmod u+w major
  # echo -1 > major
  # cat major
  -1

New behavior:
  ...
  # chmod u+w major
  # echo -1 > major
  -bash: echo: write error: Input/output error

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2014-12-18 12:38:51 +10:30
Linus Torvalds
87c31b39ab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace related fixes from Eric Biederman:
 "As these are bug fixes almost all of thes changes are marked for
  backporting to stable.

  The first change (implicitly adding MNT_NODEV on remount) addresses a
  regression that was created when security issues with unprivileged
  remount were closed.  I go on to update the remount test to make it
  easy to detect if this issue reoccurs.

  Then there are a handful of mount and umount related fixes.

  Then half of the changes deal with the a recently discovered design
  bug in the permission checks of gid_map.  Unix since the beginning has
  allowed setting group permissions on files to less than the user and
  other permissions (aka ---rwx---rwx).  As the unix permission checks
  stop as soon as a group matches, and setgroups allows setting groups
  that can not later be dropped, results in a situtation where it is
  possible to legitimately use a group to assign fewer privileges to a
  process.  Which means dropping a group can increase a processes
  privileges.

  The fix I have adopted is that gid_map is now no longer writable
  without privilege unless the new file /proc/self/setgroups has been
  set to permanently disable setgroups.

  The bulk of user namespace using applications even the applications
  using applications using user namespaces without privilege remain
  unaffected by this change.  Unfortunately this ix breaks a couple user
  space applications, that were relying on the problematic behavior (one
  of which was tools/selftests/mount/unprivileged-remount-test.c).

  To hopefully prevent needing a regression fix on top of my security
  fix I rounded folks who work with the container implementations mostly
  like to be affected and encouraged them to test the changes.

    > So far nothing broke on my libvirt-lxc test bed. :-)
    > Tested with openSUSE 13.2 and libvirt 1.2.9.
    > Tested-by: Richard Weinberger <richard@nod.at>

    > Tested on Fedora20 with libvirt 1.2.11, works fine.
    > Tested-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>

    > Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
    > Just to be sure I was testing the right thing I also tested using
    > my unprivileged nsexec testcases, and they failed on setgroup/setgid
    > as now expected, and succeeded there without your patches.
    > Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com>

    > I tested this with Sandstorm.  It breaks as is and it works if I add
    > the setgroups thing.
    > Tested-by: Andy Lutomirski <luto@amacapital.net> # breaks things as designed :("

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  userns: Unbreak the unprivileged remount tests
  userns; Correct the comment in map_write
  userns: Allow setting gid_maps without privilege when setgroups is disabled
  userns: Add a knob to disable setgroups on a per user namespace basis
  userns: Rename id_map_mutex to userns_state_mutex
  userns: Only allow the creator of the userns unprivileged mappings
  userns: Check euid no fsuid when establishing an unprivileged uid mapping
  userns: Don't allow unprivileged creation of gid mappings
  userns: Don't allow setgroups until a gid mapping has been setablished
  userns: Document what the invariant required for safe unprivileged mappings.
  groups: Consolidate the setgroups permission checks
  mnt: Clear mnt_expire during pivot_root
  mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
  mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
  umount: Do not allow unmounting rootfs.
  umount: Disallow unprivileged mount force
  mnt: Update unprivileged remount test
  mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
2014-12-17 12:31:40 -08:00
Linus Torvalds
603ba7e41b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile #2 from Al Viro:
 "Next pile (and there'll be one or two more).

  The large piece in this one is getting rid of /proc/*/ns/* weirdness;
  among other things, it allows to (finally) make nameidata completely
  opaque outside of fs/namei.c, making for easier further cleanups in
  there"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  coda_venus_readdir(): use file_inode()
  fs/namei.c: fold link_path_walk() call into path_init()
  path_init(): don't bother with LOOKUP_PARENT in argument
  fs/namei.c: new helper (path_cleanup())
  path_init(): store the "base" pointer to file in nameidata itself
  make default ->i_fop have ->open() fail with ENXIO
  make nameidata completely opaque outside of fs/namei.c
  kill proc_ns completely
  take the targets of /proc/*/ns/* symlinks to separate fs
  bury struct proc_ns in fs/proc
  copy address of proc_ns_ops into ns_common
  new helpers: ns_alloc_inum/ns_free_inum
  make proc_ns_operations work with struct ns_common * instead of void *
  switch the rest of proc_ns_operations to working with &...->ns
  netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
  make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
  common object embedded into various struct ....ns
2014-12-16 15:53:03 -08:00
Linus Torvalds
a7c180aa7e Merge tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
 "As the merge window is still open, and this code was not as complex as
  I thought it might be.  I'm pushing this in now.

  This will allow Thomas to debug his irq work for 3.20.

  This adds two new features:

  1) Allow traceopoints to be enabled right after mm_init().

     By passing in the trace_event= kernel command line parameter,
     tracepoints can be enabled at boot up.  For debugging things like
     the initialization of interrupts, it is needed to have tracepoints
     enabled very early.  People have asked about this before and this
     has been on my todo list.  As it can be helpful for Thomas to debug
     his upcoming 3.20 IRQ work, I'm pushing this now.  This way he can
     add tracepoints into the IRQ set up and have users enable them when
     things go wrong.

  2) Have the tracepoints printed via printk() (the console) when they
     are triggered.

     If the irq code locks up or reboots the box, having the tracepoint
     output go into the kernel ring buffer is useless for debugging.
     But being able to add the tp_printk kernel command line option
     along with the trace_event= option will have these tracepoints
     printed as they occur, and that can be really useful for debugging
     early lock up or reboot problems.

  This code is not that intrusive and it passed all my tests.  Thomas
  tried them out too and it works for his needs.

   Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org"

* tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Add tp_printk cmdline to have tracepoints go to printk()
  tracing: Move enabling tracepoints to just after rcu_init()
2014-12-16 12:53:59 -08:00
Linus Torvalds
988adfdffd Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
 "Highlights:

   - AMD KFD driver merge

     This is the AMD HSA interface for exposing a lowlevel interface for
     GPGPU use.  They have an open source userspace built on top of this
     interface, and the code looks as good as it was going to get out of
     tree.

   - Initial atomic modesetting work

     The need for an atomic modesetting interface to allow userspace to
     try and send a complete set of modesetting state to the driver has
     arisen, and been suffering from neglect this past year.  No more,
     the start of the common code and changes for msm driver to use it
     are in this tree.  Ongoing work to get the userspace ioctl finished
     and the code clean will probably wait until next kernel.

   - DisplayID 1.3 and tiled monitor exposed to userspace.

     Tiled monitor property is now exposed for userspace to make use of.

   - Rockchip drm driver merged.

   - imx gpu driver moved out of staging

  Other stuff:

   - core:
        panel - MIPI DSI + new panels.
        expose suggested x/y properties for virtual GPUs

   - i915:
        Initial Skylake (SKL) support
        gen3/4 reset work
        start of dri1/ums removal
        infoframe tracking
        fixes for lots of things.

   - nouveau:
        tegra k1 voltage support
        GM204 modesetting support
        GT21x memory reclocking work

   - radeon:
        CI dpm fixes
        GPUVM improvements
        Initial DPM fan control

   - rcar-du:
        HDMI support added
        removed some support for old boards
        slave encoder driver for Analog Devices adv7511

   - exynos:
        Exynos4415 SoC support

   - msm:
        a4xx gpu support
        atomic helper conversion

   - tegra:
        iommu support
        universal plane support
        ganged-mode DSI support

   - sti:
        HDMI i2c improvements

   - vmwgfx:
        some late fixes.

   - qxl:
        use suggested x/y properties"

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
  drm: sti: fix module compilation issue
  drm/i915: save/restore GMBUS freq across suspend/resume on gen4
  drm: sti: correctly cleanup CRTC and planes
  drm: sti: add HQVDP plane
  drm: sti: add cursor plane
  drm: sti: enable auxiliary CRTC
  drm: sti: fix delay in VTG programming
  drm: sti: prepare sti_tvout to support auxiliary crtc
  drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
  drm: sti: fix hdmi avi infoframe
  drm: sti: remove event lock while disabling vblank
  drm: sti: simplify gdp code
  drm: sti: clear all mixer control
  drm: sti: remove gpio for HDMI hot plug detection
  drm: sti: allow to change hdmi ddc i2c adapter
  drm/doc: Document drm_add_modes_noedid() usage
  drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
  drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
  drm: Zero out DRM object memory upon cleanup
  drm/i915/bdw: Fix the write setting up the WIZ hashing mode
  ...
2014-12-15 15:52:01 -08:00
Steven Rostedt (Red Hat)
0daa230296 tracing: Add tp_printk cmdline to have tracepoints go to printk()
Add the kernel command line tp_printk option that will have tracepoints
that are active sent to printk() as well as to the trace buffer.

Passing "tp_printk" will activate this. To turn it off, the sysctl
/proc/sys/kernel/tracepoint_printk can have '0' echoed into it. Note,
this only works if the cmdline option is used. Echoing 1 into the sysctl
file without the cmdline option will have no affect.

Note, this is a dangerous option. Having high frequency tracepoints send
their data to printk() can possibly cause a live lock. This is another
reason why this is only active if the command line option is used.

Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-15 10:17:38 -05:00
Steven Rostedt (Red Hat)
5f893b2639 tracing: Move enabling tracepoints to just after rcu_init()
Enabling tracepoints at boot up can be very useful. The tracepoint
can be initialized right after RCU has been. There's no need to
wait for the early_initcall() to be called. That's too late for some
things that can use tracepoints for debugging. Move the logic to
enable tracepoints out of the initcalls and into init/main.c to
right after rcu_init().

This also allows trace_printk() to be used early too.

Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos
Link: http://lkml.kernel.org/r/20141214164104.307127356@goodmis.org

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-15 10:16:50 -05:00
Linus Torvalds
37da7bbbe8 Merge tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
 "Here's the big tty/serial driver update for 3.19-rc1.

  There are a number of TTY core changes/fixes in here from Peter Hurley
  that have all been teted in linux-next for a long time now.  There are
  also the normal serial driver updates as well, full details in the
  changelog below"

* tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (219 commits)
  serial: pxa: hold port.lock when reporting modem line changes
  tty-hvsi_lib: Deletion of an unnecessary check before the function call "tty_kref_put"
  tty: Deletion of unnecessary checks before two function calls
  n_tty: Fix read_buf race condition, increment read_head after pushing data
  serial: of-serial: add PM suspend/resume support
  Revert "serial: of-serial: add PM suspend/resume support"
  Revert "serial: of-serial: fix up PM ops on no_console_suspend and port type"
  serial: 8250: don't attempt a trylock if in sysrq
  serial: core: Add big-endian iotype
  serial: samsung: use port->fifosize instead of hardcoded values
  serial: samsung: prefer to use fifosize from driver data
  serial: samsung: fix style problems
  serial: samsung: wait for transfer completion before clock disable
  serial: icom: fix error return code
  serial: tegra: clean up tty-flag assignments
  serial: Fix io address assign flow with Fintek PCI-to-UART Product
  serial: mxs-auart: fix tx_empty against shift register
  serial: mxs-auart: fix gpio change detection on interrupt
  serial: mxs-auart: Fix mxs_auart_set_ldisc()
  serial: 8250_dw: Use 64-bit access for OCTEON.
  ...
2014-12-14 15:23:32 -08:00
Linus Torvalds
caf292ae5b Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block
Pull block driver core update from Jens Axboe:
 "This is the pull request for the core block IO changes for 3.19.  Not
  a huge round this time, mostly lots of little good fixes:

   - Fix a bug in sysfs blktrace interface causing a NULL pointer
     dereference, when enabled/disabled through that API.  From Arianna
     Avanzini.

   - Various updates/fixes/improvements for blk-mq:

        - A set of updates from Bart, mostly fixing buts in the tag
          handling.

        - Cleanup/code consolidation from Christoph.

        - Extend queue_rq API to be able to handle batching issues of IO
          requests. NVMe will utilize this shortly. From me.

        - A few tag and request handling updates from me.

        - Cleanup of the preempt handling for running queues from Paolo.

        - Prevent running of unmapped hardware queues from Ming Lei.

        - Move the kdump memory limiting check to be in the correct
          location, from Shaohua.

        - Initialize all software queues at init time from Takashi. This
          prevents a kobject warning when CPUs are brought online that
          weren't online when a queue was registered.

   - Single writeback fix for I_DIRTY clearing from Tejun.  Queued with
     the core IO changes, since it's just a single fix.

   - Version X of the __bio_add_page() segment addition retry from
     Maurizio.  Hope the Xth time is the charm.

   - Documentation fixup for IO scheduler merging from Jan.

   - Introduce (and use) generic IO stat accounting helpers for non-rq
     drivers, from Gu Zheng.

   - Kill off artificial limiting of max sectors in a request from
     Christoph"

* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
  bio: modify __bio_add_page() to accept pages that don't start a new segment
  blk-mq: Fix uninitialized kobject at CPU hotplugging
  blktrace: don't let the sysfs interface remove trace from running list
  blk-mq: Use all available hardware queues
  blk-mq: Micro-optimize bt_get()
  blk-mq: Fix a race between bt_clear_tag() and bt_get()
  blk-mq: Avoid that __bt_get_word() wraps multiple times
  blk-mq: Fix a use-after-free
  blk-mq: prevent unmapped hw queue from being scheduled
  blk-mq: re-check for available tags after running the hardware queue
  blk-mq: fix hang in bt_get()
  blk-mq: move the kdump check to blk_mq_alloc_tag_set
  blk-mq: cleanup tag free handling
  blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
  blk: introduce generic io stat accounting help function
  blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
  genhd: check for int overflow in disk_expand_part_tbl()
  blk-mq: add blk_mq_free_hctx_request()
  blk-mq: export blk_mq_free_request()
  blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
  ...
2014-12-13 14:14:23 -08:00
Linus Torvalds
8f4385d590 Merge tag 'trace-seq-buf-3.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixlet from Steven Rostedt:
 "Remove unnecessary preempt_disable in printk()"

* tag 'trace-seq-buf-3.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  printk: Do not disable preemption for accessing printk_func
2014-12-13 14:04:41 -08:00
Linus Torvalds
a99abce2d9 Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "Two small patches from the audit next branch; only one of which has
  any real significant code changes, the other is simply a MAINTAINERS
  update for audit.

  The single code patch is pretty small and rather straightforward, it
  changes the audit "version" number reported to userspace from an
  integer to a bitmap which is used to indicate the functionality of the
  running kernel.  This really doesn't have much impact on the kernel,
  but it will make life easier for the audit userspace folks.

  Thankfully we were still on a version number which allowed us to do
  this without breaking userspace"

* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
  audit: convert status version to a feature bitmap
  audit: add Paul Moore to the MAINTAINERS entry
2014-12-13 13:41:28 -08:00