powerpc/eeh: Trace time on first error for PE

We're not expecting that one specific PE got frozen for over 5
times in last hour. Otherwise, the PE will be removed from the
system upon newly coming EEH errors. The patch introduces time
stamp to trace the first error on specific PE in last hour and
function to update that accordingly. Besides, the time stamp
is recovered during PE hotplug path as we did for frozen count.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This commit is contained in:
Gavin Shan
2013-06-20 13:21:01 +08:00
committed by Benjamin Herrenschmidt
parent c86085580d
commit 5a71978e4b
3 changed files with 35 additions and 0 deletions

View File

@@ -349,10 +349,12 @@ static void *eeh_report_failure(void *data, void *userdata)
*/
static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus)
{
struct timeval tstamp;
int cnt, rc;
/* pcibios will clear the counter; save the value */
cnt = pe->freeze_count;
tstamp = pe->tstamp;
/*
* We don't remove the corresponding PE instances because
@@ -385,6 +387,8 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus)
ssleep(5);
pcibios_add_pci_devices(bus);
}
pe->tstamp = tstamp;
pe->freeze_count = cnt;
return 0;
@@ -425,6 +429,7 @@ void eeh_handle_event(struct eeh_pe *pe)
return;
}
eeh_pe_update_time_stamp(pe);
pe->freeze_count++;
if (pe->freeze_count > EEH_MAX_ALLOWED_FREEZES)
goto excess_failures;