Commit Graph

15131 Commits

Author SHA1 Message Date
Linus Torvalds
4c64616bb5 Merge branch 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/debug changes from Ingo Molnar.

* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Fix section warnings
  x86-64: Fix CFI data for common_interrupt()
  x86: Properly _init-annotate NMI selftest code
  x86/debug: Fix/improve the show_msr=<cpus> debug print out
2012-03-22 09:30:39 -07:00
Linus Torvalds
c5c7fb8fbd Merge branches 'x86-cpu-for-linus', 'x86-boot-for-linus', 'x86-cpufeature-for-linus', 'x86-process-for-linus' and 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull trivial x86 branches from Ingo Molnar: small one-liners to fix up
details.

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Remove some noise from boot log when starting cpus

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, boot: Fix port argument to inl() function

* 'x86-cpufeature-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, cpufeature: Add CPU features from Intel document 319433-012A

* 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86_64: Record stack pointer before task execution begins

* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/UV: Lower UV rtc clocksource rating
2012-03-22 09:28:15 -07:00
Linus Torvalds
e17fdf5c67 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/asm changes from Ingo Molnar

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Include probe_roms.h in probe_roms.c
  x86/32: Print control and debug registers for kerenel context
  x86: Tighten dependencies of CPU_SUP_*_32
  x86/numa: Improve internode cache alignment
  x86: Fix the NMI nesting comments
  x86-64: Improve insn scheduling in SAVE_ARGS_IRQ
  x86-64: Fix CFI annotations for NMI nesting code
  bitops: Add missing parentheses to new get_order macro
  bitops: Optimise get_order()
  bitops: Adjust the comment on get_order() to describe the size==0 case
  x86/spinlocks: Eliminate TICKET_MASK
  x86-64: Handle byte-wise tail copying in memcpy() without a loop
  x86-64: Fix memcpy() to support sizes of 4Gb and above
  x86-64: Fix memset() to support sizes of 4Gb and above
  x86-64: Slightly shorten copy_page()
2012-03-22 09:13:24 -07:00
Len Brown
e840dfe334 Merge branch 'stable/for-x86-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into tboot 2012-03-22 01:31:09 -04:00
Xiao Guangrong
b716ad953a mm: search from free_area_cache for the bigger size
If the required size is bigger than cached_hole_size it is better to
search from free_area_cache - it is easier to get a free region,
specifically for the 64 bit process whose address space is large enough

Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:56 -07:00
Andrea Arcangeli
1a5a9906d4 mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode.  In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.

It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds).  The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().

Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously.  This is
probably why it wasn't common to run into this.  For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.

Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).

The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value.  Even if the real pmd is changing under the
value we hold on the stack, we don't care.  If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).

All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd.  The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds).  I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).

		if (pmd_trans_huge(*pmd)) {
			if (next-addr != HPAGE_PMD_SIZE) {
				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
				split_huge_page_pmd(vma->vm_mm, pmd);
			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
				continue;
			/* fall through */
		}
		if (pmd_none_or_clear_bad(pmd))

Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.

The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.

====== start quote =======
      mapcount 0 page_mapcount 1
      kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

      mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

        143 void pmd_clear_bad(pmd_t *pmd)
        144 {
    ->  145         pmd_ERROR(*pmd);
        146         pmd_clear(pmd);
        147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

       1381         if (mapcount != page_mapcount(page))
       1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
       1383                        mapcount, page_mapcount(page));
    -> 1384         BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

               virtual address space
              .---------------------.
              |                     |
              |                     |
            .-|---------------------|
            | |                     |
            | |                     |<-- B(fault)
            | |                     |
      2 MB  | |/////////////////////|-.
      huge <  |/////////////////////|  > A(range)
      page  | |/////////////////////|-'
            | |                     |
            | |                     |
            '-|---------------------|
              |                     |
              |                     |
              '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
      on the virtual address range "A(range)" shown in the picture.

    sys_madvise
      // Acquire the semaphore in shared mode.
      down_read(&current->mm->mmap_sem)
      ...
      madvise_vma
        switch (behavior)
        case MADV_DONTNEED:
             madvise_dontneed
               zap_page_range
                 unmap_vmas
                   unmap_page_range
                     zap_pud_range
                       zap_pmd_range
                         //
                         // Assume that this huge page has never been accessed.
                         // I.e. content of the PMD entry is zero (not mapped).
                         //
                         if (pmd_trans_huge(*pmd)) {
                             // We don't get here due to the above assumption.
                         }
                         //
                         // Assume that Thread B incurred a page fault and
             .---------> // sneaks in here as shown below.
             |           //
             |           if (pmd_none_or_clear_bad(pmd))
             |               {
             |                 if (unlikely(pmd_bad(*pmd)))
             |                     pmd_clear_bad
             |                     {
             |                       pmd_ERROR
             |                         // Log "bad pmd ..." message here.
             |                       pmd_clear
             |                         // Clear the page's PMD entry.
             |                         // Thread B incremented the map count
             |                         // in page_add_new_anon_rmap(), but
             |                         // now the page is no longer mapped
             |                         // by a PMD entry (-> inconsistency).
             |                     }
             |               }
             |
             v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
      in the picture.

    ...
    do_page_fault
      __do_page_fault
        // Acquire the semaphore in shared mode.
        down_read_trylock(&mm->mmap_sem)
        ...
        handle_mm_fault
          if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
              // We get here due to the above assumption (PMD entry is zero).
              do_huge_pmd_anonymous_page
                alloc_hugepage_vma
                  // Allocate a new transparent huge page here.
                ...
                __do_huge_pmd_anonymous_page
                  ...
                  spin_lock(&mm->page_table_lock)
                  ...
                  page_add_new_anon_rmap
                    // Here we increment the page's map count (starts at -1).
                    atomic_set(&page->_mapcount, 0)
                  set_pmd_at
                    // Here we set the page's PMD entry which will be cleared
                    // when Thread A calls pmd_clear_bad().
                  ...
                  spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read).  Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated.  However, Thread A
    does not synchronize on that lock.

====== end quote =======

[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>		[2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:54 -07:00
Linus Torvalds
c207f3a431 Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull irq_domain support for all architectures from Grant Likely:
 "Generialize powerpc's irq_host as irq_domain

  This branch takes the PowerPC irq_host infrastructure (reverse mapping
  from Linux IRQ numbers to hardware irq numbering), generalizes it,
  renames it to irq_domain, and makes it available to all architectures.

  Originally the plan has been to create an all-new irq_domain
  implementation which addresses some of the powerpc shortcomings such
  as not handling 1:1 mappings well, but doing that proved to be far
  more difficult and invasive than generalizing the working code and
  refactoring it in-place.  So, this branch rips out the 'new'
  irq_domain and replaces it with the modified powerpc version (in a
  fully bisectable way of course).  It converts all users over to the
  new API and makes irq_domain selectable on any architecture.

  No architecture is forced to enable irq_domain, but the infrastructure
  is required for doing OpenFirmware style irq translations.  It will
  even work on SPARC even though SPARC has it's own mechanism for
  translating irqs at boot time.  MIPS, microblaze, embedded x86 and c6x
  are converted too.

  The resulting irq_domain code is probably still too verbose and can be
  optimized more, but that can be done incrementally and is a task for
  follow-on patches."

* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6: (31 commits)
  dt: fix twl4030 for non-dt compile on x86
  mfd: twl-core: Add IRQ_DOMAIN dependency
  devicetree: Add empty of_platform_populate() for !CONFIG_OF_ADDRESS (sparc)
  irq_domain: Centralize definition of irq_dispose_mapping()
  irq_domain/mips: Allow irq_domain on MIPS
  irq_domain/x86: Convert x86 (embedded) to use common irq_domain
  ppc-6xx: fix build failure in flipper-pic.c and hlwd-pic.c
  irq_domain/microblaze: Convert microblaze to use irq_domains
  irq_domain/powerpc: Replace custom xlate functions with library functions
  irq_domain/powerpc: constify irq_domain_ops
  irq_domain/c6x: Use library of xlate functions
  irq_domain/c6x: constify irq_domain structures
  irq_domain/c6x: Convert c6x to use generic irq_domain support.
  irq_domain: constify irq_domain_ops
  irq_domain: Create common xlate functions that device drivers can use
  irq_domain: Remove irq_domain_add_simple()
  irq_domain: Remove 'new' irq_domain in favour of the ppc one
  mfd: twl-core.c: Fix the number of interrupts managed by twl4030
  of/address: add empty static inlines for !CONFIG_OF
  irq_domain: Add support for base irq and hwirq in legacy mappings
  ...
2012-03-21 10:27:19 -07:00
Linus Torvalds
c7c66c0cb0 Merge tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates for 3.4 from Rafael Wysocki:
 "Assorted extensions and fixes including:

  * Introduction of early/late suspend/hibernation device callbacks.
  * Generic PM domains extensions and fixes.
  * devfreq updates from Axel Lin and MyungJoo Ham.
  * Device PM QoS updates.
  * Fixes of concurrency problems with wakeup sources.
  * System suspend and hibernation fixes."

* tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits)
  PM / Domains: Check domain status during hibernation restore of devices
  PM / devfreq: add relation of recommended frequency.
  PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on()
  PM / shmobile: Make CMT driver use pm_genpd_dev_always_on()
  PM / shmobile: Make TMU driver use pm_genpd_dev_always_on()
  PM / Domains: Introduce "always on" device flag
  PM / Domains: Fix hibernation restore of devices, v2
  PM / Domains: Fix handling of wakeup devices during system resume
  sh_mmcif / PM: Use PM QoS latency constraint
  tmio_mmc / PM: Use PM QoS latency constraint
  PM / QoS: Make it possible to expose PM QoS latency constraints
  PM / Sleep: JBD and JBD2 missing set_freezable()
  PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case
  PM / Freezer: Remove references to TIF_FREEZE in comments
  PM / Sleep: Add more wakeup source initialization routines
  PM / Hibernate: Enable usermodehelpers in hibernate() error path
  PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
  PM / Sleep: Fix race conditions related to wakeup source timer function
  PM / Sleep: Fix possible infinite loop during wakeup source destruction
  PM / Hibernate: print physical addresses consistently with other parts of kernel
  ...
2012-03-21 10:15:51 -07:00
Linus Torvalds
9f3938346a Merge branch 'kmap_atomic' of git://github.com/congwang/linux
Pull kmap_atomic cleanup from Cong Wang.

It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().

Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.

* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
  feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
  highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
  drbd: remove the second argument of k[un]map_atomic()
  zcache: remove the second argument of k[un]map_atomic()
  gma500: remove the second argument of k[un]map_atomic()
  dm: remove the second argument of k[un]map_atomic()
  tomoyo: remove the second argument of k[un]map_atomic()
  sunrpc: remove the second argument of k[un]map_atomic()
  rds: remove the second argument of k[un]map_atomic()
  net: remove the second argument of k[un]map_atomic()
  mm: remove the second argument of k[un]map_atomic()
  lib: remove the second argument of k[un]map_atomic()
  power: remove the second argument of k[un]map_atomic()
  kdb: remove the second argument of k[un]map_atomic()
  udf: remove the second argument of k[un]map_atomic()
  ubifs: remove the second argument of k[un]map_atomic()
  squashfs: remove the second argument of k[un]map_atomic()
  reiserfs: remove the second argument of k[un]map_atomic()
  ocfs2: remove the second argument of k[un]map_atomic()
  ntfs: remove the second argument of k[un]map_atomic()
  ...
2012-03-21 09:40:26 -07:00
Linus Torvalds
4a52246302 Merge tag 'driver-core-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches for 3.4-rc1 from Greg KH:
 "Here's the big driver core merge for 3.4-rc1.

  Lots of various things here, sysfs fixes/tweaks (with the nlink
  breakage reverted), dynamic debugging updates, w1 drivers, hyperv
  driver updates, and a variety of other bits and pieces, full
  information in the shortlog."

* tag 'driver-core-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (78 commits)
  Tools: hv: Support enumeration from all the pools
  Tools: hv: Fully support the new KVP verbs in the user level daemon
  Drivers: hv: Support the newly introduced KVP messages in the driver
  Drivers: hv: Add new message types to enhance KVP
  regulator: Support driver probe deferral
  Revert "sysfs: Kill nlink counting."
  uevent: send events in correct order according to seqnum (v3)
  driver core: minor comment formatting cleanups
  driver core: move the deferred probe pointer into the private area
  drivercore: Add driver probe deferral mechanism
  DS2781 Maxim Stand-Alone Fuel Gauge battery and w1 slave drivers
  w1_bq27000: Only one thread can access the bq27000 at a time.
  w1_bq27000 - remove w1_bq27000_write
  w1_bq27000: remove unnecessary NULL test.
  sysfs: Fix memory leak in sysfs_sd_setsecdata().
  intel_idle: Revert change of auto_demotion_disable_flags for Nehalem
  w1: Fix w1_bq27000
  driver-core: documentation: fix up Greg's email address
  powernow-k6: Really enable auto-loading
  powernow-k7: Fix CPU family number
  ...
2012-03-20 11:16:20 -07:00
Linus Torvalds
161f7a7161 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer changes for v3.4 from Ingo Molnar

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  ntp: Fix integer overflow when setting time
  math: Introduce div64_long
  cs5535-clockevt: Allow the MFGPT IRQ to be shared
  cs5535-clockevt: Don't ignore MFGPT on SMP-capable kernels
  x86/time: Eliminate unused irq0_irqs counter
  clocksource: scx200_hrt: Fix the build
  x86/tsc: Reduce the TSC sync check time for core-siblings
  timer: Fix bad idle check on irq entry
  nohz: Remove ts->Einidle checks before restarting the tick
  nohz: Remove update_ts_time_stat from tick_nohz_start_idle
  clockevents: Leave the broadcast device in shutdown mode when not needed
  clocksource: Load the ACPI PM clocksource asynchronously
  clocksource: scx200_hrt: Convert scx200 to use clocksource_register_hz
  clocksource: Get rid of clocksource_calc_mult_shift()
  clocksource: dbx500: convert to clocksource_register_hz()
  clocksource: scx200_hrt:  use pr_<level> instead of printk
  time: Move common updates to a function
  time: Reorder so the hot data is together
  time: Remove most of xtime_lock usage in timekeeping.c
  ntp: Add ntp_lock to replace xtime_locking
  ...
2012-03-20 10:32:09 -07:00
Linus Torvalds
2ba68940c8 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes for v3.4 from Ingo Molnar

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  printk: Make it compile with !CONFIG_PRINTK
  sched/x86: Fix overflow in cyc2ns_offset
  sched: Fix nohz load accounting -- again!
  sched: Update yield() docs
  printk/sched: Introduce special printk_sched() for those awkward moments
  sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
  sched: Cleanup cpu_active madness
  sched: Fix load-balance wreckage
  sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
  sched: Ditch per cgroup task lists for load-balancing
  sched: Rename load-balancing fields
  sched: Move load-balancing arguments into helper struct
  sched/rt: Do not submit new work when PI-blocked
  sched/rt: Prevent idle task boosting
  sched/wait: Add __wake_up_all_locked() API
  sched/rt: Document scheduler related skip-resched-check sites
  sched/rt: Use schedule_preempt_disabled()
  sched/rt: Add schedule_preempt_disabled()
  sched/rt: Do not throttle when PI boosting
  sched/rt: Keep period timer ticking when rt throttling is active
  ...
2012-03-20 10:31:44 -07:00
Linus Torvalds
9c2b957db1 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events changes for v3.4 from Ingo Molnar:

 - New "hardware based branch profiling" feature both on the kernel and
   the tooling side, on CPUs that support it.  (modern x86 Intel CPUs
   with the 'LBR' hardware feature currently.)

   This new feature is basically a sophisticated 'magnifying glass' for
   branch execution - something that is pretty difficult to extract from
   regular, function histogram centric profiles.

   The simplest mode is activated via 'perf record -b', and the result
   looks like this in perf report:

	$ perf record -b any_call,u -e cycles:u branchy

	$ perf report -b --sort=symbol
	    52.34%  [.] main                   [.] f1
	    24.04%  [.] f1                     [.] f3
	    23.60%  [.] f1                     [.] f2
	     0.01%  [k] _IO_new_file_xsputn    [k] _IO_file_overflow
	     0.01%  [k] _IO_vfprintf_internal  [k] _IO_new_file_xsputn
	     0.01%  [k] _IO_vfprintf_internal  [k] strchrnul
	     0.01%  [k] __printf               [k] _IO_vfprintf_internal
	     0.01%  [k] main                   [k] __printf

   This output shows from/to branch columns and shows the highest
   percentage (from,to) jump combinations - i.e.  the most likely taken
   branches in the system.  "branches" can also include function calls
   and any other synchronous and asynchronous transitions of the
   instruction pointer that are not 'next instruction' - such as system
   calls, traps, interrupts, etc.

   This feature comes with (hopefully intuitive) flat ascii and TUI
   support in perf report.

 - Various 'perf annotate' visual improvements for us assembly junkies.
   It will now recognize function calls in the TUI and by hitting enter
   you can follow the call (recursively) and back, amongst other
   improvements.

 - Multiple threads/processes recording support in perf record, perf
   stat, perf top - which is activated via a comma-list of PIDs:

	perf top -p 21483,21485
	perf stat -p 21483,21485 -ddd
	perf record -p 21483,21485

 - Support for per UID views, via the --uid paramter to perf top, perf
   report, etc.  For example 'perf top --uid mingo' will only show the
   tasks that I am running, excluding other users, root, etc.

 - Jump label restructurings and improvements - this includes the
   factoring out of the (hopefully much clearer) include/linux/static_key.h
   generic facility:

	struct static_key key = STATIC_KEY_INIT_FALSE;

	...

	if (static_key_false(&key))
	        do unlikely code
	else
	        do likely code

	...
	static_key_slow_inc();
	...
	static_key_slow_inc();
	...

   The static_key_false() branch will be generated into the code with as
   little impact to the likely code path as possible.  the
   static_key_slow_*() APIs flip the branch via live kernel code patching.

   This facility can now be used more widely within the kernel to
   micro-optimize hot branches whose likelihood matches the static-key
   usage and fast/slow cost patterns.

 - SW function tracer improvements: perf support and filtering support.

 - Various hardenings of the perf.data ABI, to make older perf.data's
   smoother on newer tool versions, to make new features integrate more
   smoothly, to support cross-endian recording/analyzing workflows
   better, etc.

 - Restructuring of the kprobes code, the splitting out of 'optprobes',
   and a corner case bugfix.

 - Allow the tracing of kernel console output (printk).

 - Improvements/fixes to user-space RDPMC support, allowing user-space
   self-profiling code to extract PMU counts without performing any
   system calls, while playing nice with the kernel side.

 - 'perf bench' improvements

 - ... and lots of internal restructurings, cleanups and fixes that made
   these features possible.  And, as usual this list is incomplete as
   there were also lots of other improvements

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits)
  perf report: Fix annotate double quit issue in branch view mode
  perf report: Remove duplicate annotate choice in branch view mode
  perf/x86: Prettify pmu config literals
  perf report: Enable TUI in branch view mode
  perf report: Auto-detect branch stack sampling mode
  perf record: Add HEADER_BRANCH_STACK tag
  perf record: Provide default branch stack sampling mode option
  perf tools: Make perf able to read files from older ABIs
  perf tools: Fix ABI compatibility bug in print_event_desc()
  perf tools: Enable reading of perf.data files from different ABI rev
  perf: Add ABI reference sizes
  perf report: Add support for taken branch sampling
  perf record: Add support for sampling taken branch
  perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK
  x86/kprobes: Split out optprobe related code to kprobes-opt.c
  x86/kprobes: Fix a bug which can modify kernel code permanently
  x86/kprobes: Fix instruction recovery on optimized path
  perf: Add callback to flush branch_stack on context switch
  perf: Disable PERF_SAMPLE_BRANCH_* when not supported
  perf/x86: Add LBR software filter support for Intel CPUs
  ...
2012-03-20 10:29:15 -07:00
Linus Torvalds
0bbfcaff9b Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq/core changes for v3.4 from Ingo Molnar

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Remove paranoid warnons and bogus fixups
  genirq: Flush the irq thread on synchronization
  genirq: Get rid of unnecessary IRQTF_DIED flag
  genirq: No need to check IRQTF_DIED before stopping a thread handler
  genirq: Get rid of unnecessary irqaction field in task_struct
  genirq: Fix incorrect check for forced IRQ thread handler
  softirq: Reduce invoke_softirq() code duplication
  genirq: Fix long-term regression in genirq irq_set_irq_type() handling
  x86-32/irq: Don't switch to irq stack for a user-mode irq
2012-03-20 10:28:56 -07:00
Cong Wang
8fd75e1216 x86: remove the second argument of k[un]map_atomic()
Acked-by: Avi Kivity <avi@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:15 +08:00
Marcelo Tosatti
b74f05d61b x86: kvmclock: abstract save/restore sched_clock_state
Upon resume from hibernation, CPU 0's hvclock area contains the old
values for system_time and tsc_timestamp. It is necessary for the
hypervisor to update these values with uptodate ones before the CPU uses
them.

Abstract TSC's save/restore sched_clock_state functions and use
restore_state to write to KVM_SYSTEM_TIME MSR, forcing an update.

Also move restore_sched_clock_state before __restore_processor_state,
since the later calls CONFIG_LOCK_STAT's lockstat_clock (also for TSC).
Thanks to Igor Mammedov for tracking it down.

Fixes suspend-to-disk with kvmclock.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-20 12:37:45 +02:00
Steffen Persvold
943bc7e110 x86: Fix section warnings
Fix the following section warnings :

WARNING: vmlinux.o(.text+0x49dbc): Section mismatch in reference
from the function acpi_map_cpu2node() to the variable
.cpuinit.data:__apicid_to_node The function acpi_map_cpu2node()
references the variable __cpuinitdata __apicid_to_node. This is
often because acpi_map_cpu2node lacks a __cpuinitdata
annotation or the annotation of __apicid_to_node is wrong.

WARNING: vmlinux.o(.text+0x49dc1): Section mismatch in reference
from the function acpi_map_cpu2node() to the function
.cpuinit.text:numa_set_node() The function acpi_map_cpu2node()
references the function __cpuinit numa_set_node(). This is often
because acpi_map_cpu2node lacks a __cpuinit  annotation or the
annotation of numa_set_node is wrong.

WARNING: vmlinux.o(.text+0x526e77): Section mismatch in
reference from the function prealloc_protection_domains() to the
function .init.text:alloc_passthrough_domain() The function
prealloc_protection_domains() references the function __init
alloc_passthrough_domain(). This is often because
prealloc_protection_domains lacks a __init  annotation or the annotation of alloc_passthrough_domain is wrong.

Signed-off-by: Steffen Persvold <sp@numascale.com>
Link: http://lkml.kernel.org/r/1331810188-24785-1-git-send-email-sp@numascale.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-19 12:01:01 +01:00
Jiri Olsa
641cc93881 perf: Adding sysfs group format attribute for pmu device
Adding sysfs group 'format' attribute for pmu device that
contains a syntax description on how to construct raw events.

The event configuration is described in following
struct pefr_event_attr attributes:

  config
  config1
  config2

Each sysfs attribute within the format attribute group,
describes mapping of name and bitfield definition within
one of above attributes.

eg:
  "/sys/...<dev>/format/event" contains "config:0-7"
  "/sys/...<dev>/format/umask" contains "config:8-15"
  "/sys/...<dev>/format/usr"   contains "config:16"

the attribute value syntax is:

  line:      config ':' bits
  config:    'config' | 'config1' | 'config2"
  bits:      bits ',' bit_term | bit_term
  bit_term:  VALUE '-' VALUE | VALUE

Adding format attribute definitions for x86 cpu pmus.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/n/tip-vhdk5y2hyype9j63prymty36@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2012-03-16 14:06:06 -03:00
Alok Kataria
57779dc2b3 x86, tsc: Skip refined tsc calibration on systems with reliable TSC
While running the latest Linux as guest under VMware in highly
over-committed situations, we have seen cases when the refined TSC
algorithm fails to get a valid tsc_start value in
tsc_refine_calibration_work from multiple attempts. As a result the
kernel keeps on scheduling the tsc_irqwork task for later. Subsequently
after several attempts when it gets a valid start value it goes through
the refined calibration and either bails out or uses the new results.
Given that the kernel originally read the TSC frequency from the
platform, which is the best it can get, I don't think there is much
value in refining it.

So  for systems which get the TSC frequency from the platform we
should skip the refined tsc algorithm.

We can use the TSC_RELIABLE cpu cap flag to detect this, right now it is
set only on VMware and for Moorestown Penwell both of which have there
own TSC calibration methods.

Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Dirk Brandewie <dirk.brandewie@gmail.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: stable@kernel.org
[jstultz: Reworked to simply not schedule the refining work,
rather then scheduling the work and bombing out later]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-03-15 18:23:11 -07:00
Thomas Gleixner
2ab516575f x86: vdso: Use seqcount instead of seqlock
The update of the vdso data happens under xtime_lock, so adding a
nested lock is pointless. Just use a seqcount to sync the readers.

Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-03-15 18:17:58 -07:00
Thomas Gleixner
6c260d5863 x86: vdso: Remove bogus locking in update_vsyscall_tz()
Changing the sequence count in update_vsyscall_tz() is completely
pointless.

The vdso code copies the data unprotected. There is no point to change
this as sys_tz is nowhere protected at all. See sys_gettimeofday().

Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-03-15 18:17:57 -07:00
Daniel J Blueman
fa63030e9c x86/platform: Move APIC ID validity check into platform APIC code
Move APIC ID validity check into platform APIC code, so it can
be overridden when needed. For NumaChip systems, always trust
MADT, as it's constructed with high APIC IDs.

Behaviour verifies on standard x86 systems and on NumaChip
systems with this, and compile-tested with allyesconfig.

Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
Reviewed-by: Steffen Persvold <sp@numascale.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1331709454-27966-1-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-14 09:49:48 +01:00
Ingo Molnar
ea281a9eba Merge tag 'mce-for-tip' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce
Apply two miscellaneous MCE fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-14 07:44:48 +01:00
Ingo Molnar
cd593accdc Merge tag 'v3.3-rc7' into x86/mce
Merge reason: Update from an ancient -rc1 base to an almost-final stable kernel.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-14 07:44:11 +01:00
Srikar Dronamraju
0326f5a94d uprobes/core: Handle breakpoint and singlestep exceptions
Uprobes uses exception notifiers to get to know if a thread hit
a breakpoint or a singlestep exception.

When a thread hits a uprobe or is singlestepping post a uprobe
hit, the uprobe exception notifier sets its TIF_UPROBE bit,
which will then be checked on its return to userspace path
(do_notify_resume() ->uprobe_notify_resume()), where the
consumers handlers are run (in task context) based on the
defined filters.

Uprobe hits are thread specific and hence we need to maintain
information about if a task hit a uprobe, what uprobe was hit,
the slot where the original instruction was copied for xol so
that it can be singlestepped with appropriate fixups.

In some cases, special care is needed for instructions that are
executed out of line (xol). These are architecture specific
artefacts, such as handling RIP relative instructions on x86_64.

Since the instruction at which the uprobe was inserted is
executed out of line, architecture specific fixups are added so
that the thread continues normal execution in the presence of a
uprobe.

Postpone the signals until we execute the probed insn.
post_xol() path does a recalc_sigpending() before return to
user-mode, this ensures the signal can't be lost.

Uprobes relies on DIE_DEBUG notification to notify if a
singlestep is complete.

Adds x86 specific uprobe exception notifiers and appropriate
hooks needed to determine a uprobe hit and subsequent post
processing.

Add requisite x86 fixups for xol for uprobes. Specific cases
needing fixups include relative jumps (x86_64), calls, etc.

Where possible, we check and skip singlestepping the
breakpointed instructions. For now we skip single byte as well
as few multibyte nop instructions. However this can be extended
to other instructions too.

Credits to Oleg Nesterov for suggestions/patches related to
signal, breakpoint, singlestep handling code.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120313180011.29771.89027.sendpatchset@srdronam.in.ibm.com
[ Performed various cleanliness edits ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-14 07:41:36 +01:00
Konrad Rzeszutek Wilk
a1f37788a6 tboot: Add return values for tboot_sleep
.. as appropiately. As tboot_sleep now returns values.
remove tboot_sleep_wrapper.

Suggested-and-Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Joseph Cihula <joseph.cihula@intel.com>
[v1: Return -1/0/+1 instead of ACPI_xx values]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-13 14:06:55 -04:00
Tang Liang
09f98a825a x86, acpi, tboot: Have a ACPI os prepare sleep instead of calling tboot_sleep.
The ACPI suspend path makes a call to tboot_sleep right before
it writes the PM1A, PM1B values. We replace the direct call to
tboot via an registration callback similar to __acpi_register_gsi.

CC: Len Brown <len.brown@intel.com>
Acked-by: Joseph Cihula <joseph.cihula@intel.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
[v1: Added __attribute__ ((unused))]
[v2: Introduced a wrapper instead of changing tboot_sleep return values]
[v3: Added return value AE_CTRL_SKIP for acpi_os_sleep_prepare]
Signed-off-by: Tang Liang <liang.tang@oracle.com>
[v1: Fix compile issues on IA64 and PPC64]
[v2: Fix where __acpi_os_prepare_sleep==NULL and did not go in sleep properly]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-13 14:06:33 -04:00
Thomas Gleixner
df8d291f28 Merge branch 'linus' into irq/core
Reason: Get upstream fixes integrated before further modifications.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-13 16:35:16 +01:00
Ingo Molnar
ef15eda982 Merge branch 'x86/cleanups' into perf/uprobes
Merge reason: We want to merge a dependent patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-13 16:33:03 +01:00
Salman Qazi
9993bc635d sched/x86: Fix overflow in cyc2ns_offset
When a machine boots up, the TSC generally gets reset.  However,
when kexec is used to boot into a kernel, the TSC value would be
carried over from the previous kernel.  The computation of
cycns_offset in set_cyc2ns_scale is prone to an overflow, if the
machine has been up more than 208 days prior to the kexec.  The
overflow happens when we multiply *scale, even though there is
enough room to store the final answer.

We fix this issue by decomposing tsc_now into the quotient and
remainder of division by CYC2NS_SCALE_FACTOR and then performing
the multiplication separately on the two components.

Refactor code to share the calculation with the previous
fix in __cycles_2_ns().

Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: john stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/20120310004027.19291.88460.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-13 16:27:51 +01:00
Ingo Molnar
47258cf3c4 Merge tag 'v3.3-rc7' into sched/core
Merge reason: merge back final fixes, prepare for the merge window.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-13 16:26:52 +01:00
Srikar Dronamraju
51e7dc7011 x86: Rename trap_no to trap_nr in thread_struct
There are precedences of trap number being referred to as
trap_nr. However thread struct refers trap number as trap_no.
Change it to trap_nr.

Also use enum instead of left-over literals for trap values.

This is pure cleanup, no functional change intended.

Suggested-by: Ingo Molnar <mingo@eltu.hu>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120312092555.5379.942.sendpatchset@srdronam.in.ibm.com
[ Fixed the math-emu build ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-13 06:24:09 +01:00
Srikar Dronamraju
e3343e6a28 uprobes/core: Make order of function parameters consistent across functions
If a function takes struct uprobe or struct arch_uprobe, then it
is passed as the first parameter.

This is pure cleanup, no functional change intended.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120312092530.5379.18394.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-13 06:22:20 +01:00
Srikar Dronamraju
900771a483 uprobes/core: Make macro names consistent
Rename macros that refer to individual uprobe to start with
UPROBE_ instead of UPROBES_.

This is pure cleanup, no functional change intended.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120312092514.5379.36595.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-13 06:22:20 +01:00
Ingo Molnar
e898c67068 Merge branch 'x86/x32' into x86/cleanups
Merge reason: We are going to merge a dependent patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-13 05:54:41 +01:00
Suresh Siddha
73d63d038e x86/ioapic: Add register level checks to detect bogus io-apic entries
With the recent changes to clear_IO_APIC_pin() which tries to
clear remoteIRR bit explicitly, some of the users started to see
"Unable to reset IRR for apic .." messages.

Close look shows that these are related to bogus IO-APIC entries
which return's all 1's for their io-apic registers. And the
above mentioned error messages are benign. But kernel should
have ignored such io-apic's in the first place.

Check if register 0, 1, 2 of the listed io-apic are all 1's and
ignore such io-apic.

Reported-by: Álvaro Castillo <midgoon@gmail.com>
Tested-by: Jon Dufresne <jon@jondufresne.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yinghai@kernel.org
Cc: kernel-team@fedoraproject.org
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1331577393.31585.94.camel@sbsiddha-desk.sc.intel.com
[ Performed minor cleanup of affected code. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-13 05:52:02 +01:00
Ingo Molnar
bea95c152d Merge branch 'perf/hw-branch-sampling' into perf/core
Merge reason: The 'perf record -b' hardware branch sampling feature is ready for upstream.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12 20:47:05 +01:00
Peter Zijlstra
f9b4eeb809 perf/x86: Prettify pmu config literals
I got somewhat tired of having to decode hex numbers..

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
Link: http://lkml.kernel.org/n/tip-0vsy1sgywc4uar3mu1szm0rg@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12 20:44:54 +01:00
Ingo Molnar
35239e23c6 Merge branch 'perf/urgent' into perf/core
Merge reason: We are going to queue up a dependent patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12 20:44:11 +01:00
Peter Zijlstra
87e24f4b67 perf/x86: Fix local vs remote memory events for NHM/WSM
Verified using the below proglet.. before:

[root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 0
remote write

 Performance counter stats for './numa 0':

         2,101,554 node-stores
         2,096,931 node-store-misses

       5.021546079 seconds time elapsed

[root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 1
local write

 Performance counter stats for './numa 1':

           501,137 node-stores
               199 node-store-misses

       5.124451068 seconds time elapsed

After:

[root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 0
remote write

 Performance counter stats for './numa 0':

         2,107,516 node-stores
         2,097,187 node-store-misses

       5.012755149 seconds time elapsed

[root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 1
local write

 Performance counter stats for './numa 1':

         2,063,355 node-stores
               165 node-store-misses

       5.082091494 seconds time elapsed

#define _GNU_SOURCE

#include <sched.h>
#include <stdio.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <dirent.h>
#include <signal.h>
#include <unistd.h>
#include <numaif.h>
#include <stdlib.h>

#define SIZE (32*1024*1024)

volatile int done;

void sig_done(int sig)
{
	done = 1;
}

int main(int argc, char **argv)
{
	cpu_set_t *mask, *mask2;
	size_t size;
	int i, err, t;
	int nrcpus = 1024;
	char *mem;
	unsigned long nodemask = 0x01; /* node 0 */
	DIR *node;
	struct dirent *de;
	int read = 0;
	int local = 0;

	if (argc < 2) {
		printf("usage: %s [0-3]\n", argv[0]);
		printf("  bit0 - local/remote\n");
		printf("  bit1 - read/write\n");
		exit(0);
	}

	switch (atoi(argv[1])) {
	case 0:
		printf("remote write\n");
		break;
	case 1:
		printf("local write\n");
		local = 1;
		break;
	case 2:
		printf("remote read\n");
		read = 1;
		break;
	case 3:
		printf("local read\n");
		local = 1;
		read = 1;
		break;
	}

	mask = CPU_ALLOC(nrcpus);
	size = CPU_ALLOC_SIZE(nrcpus);
	CPU_ZERO_S(size, mask);

	node = opendir("/sys/devices/system/node/node0/");
	if (!node)
		perror("opendir");
	while ((de = readdir(node))) {
		int cpu;

		if (sscanf(de->d_name, "cpu%d", &cpu) == 1)
			CPU_SET_S(cpu, size, mask);
	}
	closedir(node);

	mask2 = CPU_ALLOC(nrcpus);
	CPU_ZERO_S(size, mask2);
	for (i = 0; i < size; i++)
		CPU_SET_S(i, size, mask2);
	CPU_XOR_S(size, mask2, mask2, mask); // invert

	if (!local)
		mask = mask2;

	err = sched_setaffinity(0, size, mask);
	if (err)
		perror("sched_setaffinity");

	mem = mmap(0, SIZE, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	err = mbind(mem, SIZE, MPOL_BIND, &nodemask, 8*sizeof(nodemask), MPOL_MF_MOVE);
	if (err)
		perror("mbind");

	signal(SIGALRM, sig_done);
	alarm(5);

	if (!read) {
		while (!done) {
			for (i = 0; i < SIZE; i++)
				mem[i] = 0x01;
		}
	} else {
		while (!done) {
			for (i = 0; i < SIZE; i++)
				t += *(volatile char *)(mem + i);
		}
	}

	return 0;
}

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/n/tip-tq73sxus35xmqpojf7ootxgs@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12 20:43:41 +01:00
Peter Zijlstra
5fbd036b55 sched: Cleanup cpu_active madness
Stepan found:

CPU0		CPUn

_cpu_up()
  __cpu_up()

		boostrap()
		  notify_cpu_starting()
		  set_cpu_online()
		  while (!cpu_active())
		    cpu_relax()

<PREEMPT-out>

smp_call_function(.wait=1)
  /* we find cpu_online() is true */
  arch_send_call_function_ipi_mask()

  /* wait-forever-more */

<PREEMPT-in>
		  local_irq_enable()

  cpu_notify(CPU_ONLINE)
    sched_cpu_active()
      set_cpu_active()

Now the purpose of cpu_active is mostly with bringing down a cpu, where
we mark it !active to avoid the load-balancer from moving tasks to it
while we tear down the cpu. This is required because we only update the
sched_domain tree after we brought the cpu-down. And this is needed so
that some tasks can still run while we bring it down, we just don't want
new tasks to appear.

On cpu-up however the sched_domain tree doesn't yet include the new cpu,
so its invisible to the load-balancer, regardless of the active state.
So instead of setting the active state after we boot the new cpu (and
consequently having to wait for it before enabling interrupts) set the
cpu active before we set it online and avoid the whole mess.

Reported-by: Stepan Moskovchenko <stepanm@codeaurora.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12 20:43:15 +01:00
Kees Cook
c94082656d x86: Use enum instead of literals for trap values
The traps are referred to by their numbers and it can be difficult to
understand them while reading the code without context. This patch adds
enumeration of the trap numbers and replaces the numbers with the correct
enum for x86.

Signed-off-by: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/20120310000710.GA32667@www.outflux.net
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-03-09 16:47:54 -08:00
Greg Kroah-Hartman
263a5c8e16 Merge 3.3-rc6 into driver-core-next
This was done to resolve a conflict in the drivers/base/cpu.c file.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-09 12:35:53 -08:00
Robert Richter
db98c5faf8 perf/x86: Implement 64-bit counter support for IBS
This patch implements 64 bit counter support for IBS. The
sampling period is no longer limited to the hw counter width.

The functions perf_event_set_period() and
perf_event_try_update() can be used as generic functions. They
can replace similar code that is duplicate across architectures.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1323968199-9326-5-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-08 11:35:22 +01:00
Robert Richter
4db2e8e650 perf/x86: Implement IBS pmu control ops
Add code to control the IBS pmu. We need to maintain per-cpu
states. Since some states are used and changed by the nmi
handler, access to these states must be atomic.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1323968199-9326-4-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-08 11:35:22 +01:00
Robert Richter
b7074f1fbd perf/x86: Implement IBS interrupt handler
This patch implements code to handle ibs interrupts. If ibs data
is available a raw perf_event data sample is created and sent
back to the userland. This patch only implements the storage of
ibs data in the raw sample, but this could be extended in a
later patch by generating generic event data such as the rip
from the ibs sampling data.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1323968199-9326-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-08 11:35:21 +01:00
Robert Richter
510419435c perf/x86: Implement IBS event configuration
This patch implements perf configuration for AMD IBS. The IBS
pmu is selected using the type attribute in sysfs. There are two
types of ibs pmus, for instruction fetch (IBS_FETCH) and for
instruction execution (IBS_OP):

 /sys/bus/event_source/devices/ibs_fetch/type
 /sys/bus/event_source/devices/ibs_op/type

Except for the sample period IBS can only be set up with raw
config values and raw data samples. The event attributes for the
syscall should be programmed like this (IBS_FETCH):

        type = get_pmu_type("/sys/bus/event_source/devices/ibs_fetch/type");

        memset(&attr, 0, sizeof(attr));
        attr.type        = type;
        attr.sample_type = PERF_SAMPLE_CPU | PERF_SAMPLE_RAW;
        attr.config      = IBS_FETCH_CONFIG_DEFAULT;

This implementation does not yet support 64 bit counters. It is
limited to the hardware counter bit width which is 20 bits. 64
bit support can be added later.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1323968199-9326-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-08 11:35:21 +01:00
Jan Beulich
a240ada241 x86: Include probe_roms.h in probe_roms.c
... to ensure that declarations and definitions are in sync.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F5888F902000078000770F1@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-08 10:57:35 +01:00
Jan Beulich
c7e23289a6 x86/32: Print control and debug registers for kerenel context
While for a user mode register dump it may be reasonable to skip
those (albeit x86-64 doesn't do so), for kernel mode dumps these
should be printed to make sure all information possibly
necessary for analysis is available.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F58889202000078000770E7@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-08 10:57:35 +01:00
Srivatsa S. Bhat
b11e3d782b x86, mce: Fix rcu splat in drain_mce_log_buffer()
While booting, the following message is seen:

[   21.665087] ===============================
[   21.669439] [ INFO: suspicious RCU usage. ]
[   21.673798] 3.2.0-0.0.0.28.36b5ec9-default #2 Not tainted
[   21.681353] -------------------------------
[   21.685864] arch/x86/kernel/cpu/mcheck/mce.c:194 suspicious rcu_dereference_index_check() usage!
[   21.695013]
[   21.695014] other info that might help us debug this:
[   21.695016]
[   21.703488]
[   21.703489] rcu_scheduler_active = 1, debug_locks = 1
[   21.710426] 3 locks held by modprobe/2139:
[   21.714754]  #0:  (&__lockdep_no_validate__){......}, at: [<ffffffff8133afd3>] __driver_attach+0x53/0xa0
[   21.725020]  #1:
[   21.725323] ioatdma: Intel(R) QuickData Technology Driver 4.00
[   21.733206]  (&__lockdep_no_validate__){......}, at: [<ffffffff8133afe1>] __driver_attach+0x61/0xa0
[   21.743015]  #2:  (i7core_edac_lock){+.+.+.}, at: [<ffffffffa01cfa5f>] i7core_probe+0x1f/0x5c0 [i7core_edac]
[   21.753708]
[   21.753709] stack backtrace:
[   21.758429] Pid: 2139, comm: modprobe Not tainted 3.2.0-0.0.0.28.36b5ec9-default #2
[   21.768253] Call Trace:
[   21.770838]  [<ffffffff810977cd>] lockdep_rcu_suspicious+0xcd/0x100
[   21.777366]  [<ffffffff8101aa41>] drain_mcelog_buffer+0x191/0x1b0
[   21.783715]  [<ffffffff8101aa78>] mce_register_decode_chain+0x18/0x20
[   21.790430]  [<ffffffffa01cf8db>] i7core_register_mci+0x2fb/0x3e4 [i7core_edac]
[   21.798003]  [<ffffffffa01cfb14>] i7core_probe+0xd4/0x5c0 [i7core_edac]
[   21.804809]  [<ffffffff8129566b>] local_pci_probe+0x5b/0xe0
[   21.810631]  [<ffffffff812957c9>] __pci_device_probe+0xd9/0xe0
[   21.816650]  [<ffffffff813362e4>] ? get_device+0x14/0x20
[   21.822178]  [<ffffffff81296916>] pci_device_probe+0x36/0x60
[   21.828061]  [<ffffffff8133ac8a>] really_probe+0x7a/0x2b0
[   21.833676]  [<ffffffff8133af23>] driver_probe_device+0x63/0xc0
[   21.839868]  [<ffffffff8133b01b>] __driver_attach+0x9b/0xa0
[   21.845718]  [<ffffffff8133af80>] ? driver_probe_device+0xc0/0xc0
[   21.852027]  [<ffffffff81339168>] bus_for_each_dev+0x68/0x90
[   21.857876]  [<ffffffff8133aa3c>] driver_attach+0x1c/0x20
[   21.863462]  [<ffffffff8133a64d>] bus_add_driver+0x16d/0x2b0
[   21.869377]  [<ffffffff8133b6dc>] driver_register+0x7c/0x160
[   21.875220]  [<ffffffff81296bda>] __pci_register_driver+0x6a/0xf0
[   21.881494]  [<ffffffffa01fe000>] ? 0xffffffffa01fdfff
[   21.886846]  [<ffffffffa01fe047>] i7core_init+0x47/0x1000 [i7core_edac]
[   21.893737]  [<ffffffff810001ce>] do_one_initcall+0x3e/0x180
[   21.899670]  [<ffffffff810a9b95>] sys_init_module+0xc5/0x220
[   21.905542]  [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b

Fix this by using ACCESS_ONCE() instead of rcu_dereference_check_mce()
over mcelog.next. Since the access to each entry is controlled by the
->finished field, ACCESS_ONCE() should work just fine. An rcu_dereference
is unnecessary here.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2012-03-07 11:44:29 +01:00