These variables are not used from introduced version, remove them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When a write to one of the devices of a RAID5/6 fails, the failure is
recorded in the metadata of the other devices so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).
Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.
Currently there is no interlock between the write request completing
and the metadata update. So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.
This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.
So:
- set MD_CHANGE_PENDING when requesting a metadata update for a
failed device, so we can know with certainty when it completes
- queue requests that completed when MD_CHANGE_PENDING is set to
only be processed after the metadata update completes
- call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NeilBrown <neilb@suse.com>
When a write to one of the legs of a RAID10 fails, the failure is
recorded in the metadata of the other legs so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).
Currently there is no interlock between the write request completing
and the metadata update. So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.
This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.
So:
- set MD_CHANGE_PENDING when requesting a metadata update for a
failed device, so we can know with certainty when it completes
- queue requests that experienced an error on a new queue which
is only processed after the metadata update completes
- call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NeilBrown <neilb@suse.com>
When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).
Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.
Currently there is no interlock between the write request completing
and the metadata update. So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.
This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.
So:
- set MD_CHANGE_PENDING when requesting a metadata update for a
failed device, so we can know with certainty when it completes
- queue requests that experienced an error on a new queue which
is only processed after the metadata update completes
- call raid_end_bio_io() on bios in that queue when the time comes.
Signed-off-by: NeilBrown <neilb@suse.com>
md_setup_cluster already calls try_module_get(), so this
try_module_get isn't needed.
Also, there is no matching module_put (except in error patch),
so this leaves an unbalanced module count.
Signed-off-by: NeilBrown <neilb@suse.com>
This code looks racy.
The only possible race is if two modules try to register at the same
time and that won't happen. But make the code look safe anyway.
Signed-off-by: NeilBrown <neilb@suse.com>
In gather_all_resync_info, we need to read the disk bitmap sb and
check if it needs recovery.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure
complete(&cinfo->completion) is only be invoked when node
join cluster. Otherwise node failure could also call the
complete, and it doesn't make sense to do it.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
We also need to free the lock resource before goto out.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
The sb_lock is not used anywhere, so let's remove it.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
If the node just join the cluster, and receive the msg from other nodes
before init suspend_list, it will cause kernel crash due to NULL pointer
dereference, so move the initializations early to fix the bug.
md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3
BUG: unable to handle kernel NULL pointer dereference at (null)
... ... ...
Call Trace:
[<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster]
[<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster]
[<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod]
[<ffffffff810768c4>] kthread+0xb4/0xc0
[<ffffffff8151927c>] ret_from_fork+0x7c/0xb0
... ... ...
RIP [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster]
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
In complicated cluster environment, it is possible that the
dlm lock couldn't be get/convert on purpose, the related err
info is added for better debug potential issue.
For lockres_free, if the lock is blocking by a lock request or
conversion request, then dlm_unlock just put it back to grant
queue, so need to ensure the lock is free finally.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
We should init completion within lockres_init, otherwise
completion could be initialized more than one time during
it's life cycle.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
There is problem with previous communication mechanism, and we got below
deadlock scenario with cluster which has 3 nodes.
Sender Receiver Receiver
token(EX)
message(EX)
writes message
downconverts message(CR)
requests ack(EX)
get message(CR) gets message(CR)
reads message reads message
requests EX on message requests EX on message
To fix this problem, we do the following changes:
1. the sender downconverts MESSAGE to CW rather than CR.
2. and the receiver request PR lock not EX lock on message.
And in case we failed to down-convert EX to CW on message, it is better to
unlock message otherthan still hold the lock.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Lidong Zhong <ldzhong@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
When node A stops an array while the array is doing a resync, we need
to let another node B take over the resync task.
To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
message to the cluster. And the node B which received that message will
invoke __recover_slot to do resync.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Make recover_slot as a wraper to __recover_slot, since the
logic of __recover_slot can be reused for the condition
when other nodes need to take over the resync job.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
We used to set up the safemode_timer timer in md_run. If md_run
would fail before the timer was set up we'd end up trying to modify
a timer that doesn't have a callback function when we access safe_delay_store,
which would trigger a BUG.
neilb: delete init_timer() call as setup_timer() does that.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
It is possible (though unlikely) for a reshape to be
interrupted between the time that end_reshape is called
and the time when raid5_finish_reshape is called.
This can leave conf->reshape_progress set to MaxSector,
but mddev->reshape_position not.
This combination confused reshape_request() when ->reshape_backwards.
As conf->reshape_progress is so high, it seems the reshape hasn't
really begun. But assuming MaxSector is a valid address only
leads to sorrow.
So ensure reshape_position and reshape_progress both agree,
and add an extra check in reshape_request() just in case they don't.
Signed-off-by: NeilBrown <neilb@suse.com>
There can be a small window between the moment that recovery
actually writes the last block and the time when various sysfs
and /proc/mdstat attributes report that it has finished.
During this time, 'sync_completed' can have the wrong value.
This can confuse monitoring software.
So:
- don't set curr_resync_completed beyond the end of the devices,
- set it correctly when resync/recovery has completed.
Signed-off-by: NeilBrown <neilb@suse.com>
While it generally shouldn't happen, it is not impossible for
curr_resync_completed to exceed resync_max.
This can particularly happen when reshaping RAID5 - the current
status isn't copied to curr_resync_completed promptly, so when it
is, it can exceed resync_max.
This happens when the reshape is 'frozen', resync_max is set low,
and reshape is re-enabled.
Taking a difference between two unsigned numbers is always dangerous
anyway, so add a test to behave correctly if
curr_resync_completed > resync_max
Signed-off-by: NeilBrown <neilb@suse.com>
This ensures that 'sync_action' will show 'recover' immediately the
array is started. If there is no spare the status will change to
'idle' once that is detected.
Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
happens.
This allows scripts which monitor status not to get confused -
particularly my test scripts.
Signed-off-by: NeilBrown <neilb@suse.com>
This code is calculating:
writepos, which is the furthest along address (device-space) that we
*will* be writing to
readpos, which is the earliest address that we *could* possible read
from, and
safepos, which is the earliest address in the 'old' section that we
might read from after a crash when the reshape position is
recovered from metadata.
The first is a precise calculation, so clipping at zero doesn't
make sense. As the reshape position is now guaranteed to always be
a multiple of reshape_sectors and as we already BUG_ON when
reshape_progress is zero, there is no point in this min_t() call.
The readpos and safepos are worst case - actual value depends on
precise geometry. That worst case could be negative, which is only
a problem because we are storing the value in an unsigned.
So leave the min_t() for those.
Signed-off-by: NeilBrown <neilb@suse.com>
When reshaping, we work in units of the largest chunk size.
If changing from a larger to a smaller chunk size, that means we
reshape more than one stripe at a time. So the required alignment
of reshape_position needs to take into account both the old
and new chunk size.
This means that both 'here_new' and 'here_old' are calculated with
respect to the same (maximum) chunk size, so testing if they are the
same when delta_disks is zero becomes pointless.
Signed-off-by: NeilBrown <neilb@suse.com>
The chunk_sectors and new_chunk_sectors fields of mddev can be changed
any time (via sysfs) that the reconfig mutex can be taken. So raid5
keeps internal copies in 'conf' which are stable except for a short
locked moment when reshape stops/starts.
So any access that does not hold reconfig_mutex should use the 'conf'
values, not the 'mddev' values.
Several don't.
This could result in corruption if new values were written at awkward
times.
Also use min() or max() rather than open-coding.
Signed-off-by: NeilBrown <neilb@suse.com>
These aren't really needed when no reshape is happening,
but it is safer to have them always set to a meaningful value.
The next patch will use ->prev_chunk_sectors without checking
if a reshape is happening (because that makes the code simpler),
and this patch makes that safe.
Signed-off-by: NeilBrown <neilb@suse.com>
md/raid5 only updates ->reshape_position (which is stored in
metadata and is authoritative) occasionally, but particularly
when getting closed to ->resync_max as it must be correct
when ->resync_max is reached.
When mdadm tries to stop an array which is reshaping it will:
- freeze the reshape,
- set resync_max to where the reshape has reached.
- unfreeze the reshape.
When this happens, the reshape is aborted and then restarted.
The restart doesn't check that resync_max is close, and so doesn't
update ->reshape_position like it should.
This results in the reshape stopping, but ->reshape_position being
incorrect.
So on that first call to reshape_request, make sure ->reshape_position
is updated if needed.
Signed-off-by: NeilBrown <neilb@suse.com>
When checking sync_action in a script, we want to be sure it is
as accurate as possible.
As resync/reshape etc doesn't always start immediately (a separate
thread is scheduled to do it), it is best if 'action_show'
checks if MD_RECOVER_NEEDED is set (which it does) and in that
case reports what is likely to start soon (which it only sometimes
does).
So:
- report 'reshape' if reshape_position suggests one might start.
- set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
likely to happen next.
Signed-off-by: NeilBrown <neilb@suse.com>
Currently when a recovery completes, mdstat shows that it has finished
before the new device is marked as a full member. Because of this it
can appear to a script that the recovery finished but the array isn't
in sync.
So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery".
Once md_reap_sync_thread() completes, the spare will be active and then
MD_RECOVERY_DONE will be cleared.
To ensure this is race-free, set MD_RECOVERY_DONE before clearning
curr_resync.
Signed-off-by: NeilBrown <neilb@suse.com>
Pull staging driver updates from Greg KH:
"Here is the big staging driver updates for 4.3-rc1.
Lots of things all over the place, almost all of them trivial fixups
and changes. The usual IIO updates and new drivers and we have added
the MOST driver subsystem which is getting cleaned up in the tree.
The ozwpan driver is finally being deleted as it is obviously
abandoned and no one cares about it.
Full details are in the shortlog, and all of these have been in
linux-next with no reported issues"
* tag 'staging-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (912 commits)
staging/lustre/o2iblnd: remove references to ib_reg_phsy_mr()
staging: wilc1000: fix build warning with setup_timer()
staging: wilc1000: remove DECLARE_WILC_BUFFER()
staging: wilc1000: remove void function return statements that are not useful
staging: wilc1000: coreconfigurator.c: fix kmalloc error check
staging: wilc1000: coreconfigurator.c: use kmalloc instead of WILC_MALLOC
staging: wilc1000: remove unused codes of gps8ConfigPacket
staging: wilc1000: remove unnecessary void pointer cast
staging: wilc1000: remove WILC_NEW and WILC_NEW_EX
staging: wilc1000: use kmalloc instead of WILC_NEW
staging: wilc1000: Process WARN, INFO options of debug levels from user
staging: wilc1000: remove unneeded tstrWILC_MsgQueueAttrs typedef
staging: wilc1000: delete wilc_osconfig.h
staging: wilc1000: delete wilc_log.h
staging: wilc1000: delete wilc_timer.h
staging: wilc1000: remove WILC_TimerStart()
staging: wilc1000: remove WILC_TimerCreate()
staging: wilc1000: remove WILC_TimerDestroy()
staging: wilc1000: remove WILC_TimerStop()
staging: wilc1000: remove tstrWILC_TimerAttrs typedef
...
Pull driver core updates from Greg KH:
"Here is the new patches for the driver core / sysfs for 4.3-rc1.
Very small number of changes here, all the details are in the
shortlog, nothing major happening at all this kernel release, which is
nice to see"
* tag 'driver-core-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
bus: subsys: update return type of ->remove_dev() to void
driver core: correct device's shutdown order
driver core: fix docbook for device_private.device
selftests: firmware: skip timeout checks for kernels without user mode helper
kernel, cpu: Remove bogus __ref annotations
cpu: Remove bogus __ref annotation of cpu_subsys_online()
firmware: fix wrong memory deallocation in fw_add_devm_name()
sysfs.txt: update show method notes about sprintf/snprintf/scnprintf usage
devres: fix devres_get()
From B spec, DDI_E port belong to PowerWell 2, but
DDI_E share the powerwell_req/staus register bit with
DDI_A which belong to DDI_A_E_POWER_WELL.
In order to communicate with the connector on DDI-E, both
DDI_A_E_POWER_WELL and POWER_WELL_2 must be enabled.
Currently intel_dp_power_get(DDI_E) only enable
DDI_A_E_POWER_WELL, this patch will not only enable
DDI_a_E_POWER_WELL but also enable POWER_WELL_2.
This patch also fix the DDI-E hotplug function.
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
AUX addresses are 20 bits long. Send out the entire address instead of
just the low 16 bits.
Port of:
drm/radeon/atom: Send out the full AUX address
to radeon non-atom aux path
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Pull char/misc driver patches from Greg KH:
"Here's the "big" char/misc driver update for 4.3-rc1.
Not much really interesting here, just a number of little changes all
over the place, and some nice consolidation of the nvmem drivers to a
common framework. As usual, the mei drivers stand out as the largest
"churn" to handle new devices and features in their hardware.
All have been in linux-next for a while with no issues"
* tag 'char-misc-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
auxdisplay: ks0108: initialize local parport variable
extcon: palmas: Fix build break due to devm_gpiod_get_optional API change
extcon: palmas: Support GPIO based USB ID detection
extcon: Fix signedness bugs about break error handling
extcon: Drop owner assignment from i2c_driver
extcon: arizona: Simplify pdata symantics for micd_dbtime
extcon: arizona: Declare 3-pole jack if we detect open circuit on mic
extcon: Add exception handling to prevent the NULL pointer access
extcon: arizona: Ensure variables are set for headphone detection
extcon: arizona: Use gpiod inteface to handle micd_pol_gpio gpio
extcon: arizona: Add basic microphone detection DT/ACPI bindings
extcon: arizona: Update to use the new device properties API
extcon: palmas: Remove the mutually_exclusive array
extcon: Remove optional print_state() function pointer of struct extcon_dev
extcon: Remove duplicate header file in extcon.h
extcon: max77843: Clear IRQ bits state before request IRQ
toshiba laptop: replace ioremap_cache with ioremap
misc: eeprom: max6875: clean up max6875_read()
misc: eeprom: clean up eeprom_read()
misc: eeprom: 93xx46: clean up eeprom_93xx46_bin_read/write
...
There are OEMs using DDI-E out there,
so let's enable it.
Unfortunately there is no detection bit for DDI-E
So we need to rely on VBT for that.
I also need to give credits to Xiong since before seing
his approach to check info->support_* I was creating an ugly
vbt->ddie_sfuse_strap in order to propagate the ddi presence info
v2: Rebased as last patch in the series. since all other patches
in this series are needed for anything working propperly on DDI-E.
Credits-to: "Zhang, Xiong Y" <xiong.y.zhang@intel.com>
Cc: "Zhang, Xiong Y" <xiong.y.zhang@intel.com>
Reviewed-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Pull kvm updates from Paolo Bonzini:
"A very small release for x86 and s390 KVM.
- s390: timekeeping changes, cleanups and fixes
- x86: support for Hyper-V MSRs to report crashes, and a bunch of
cleanups.
One interesting feature that was planned for 4.3 (emulating the local
APIC in kernel while keeping the IOAPIC and 8254 in userspace) had to
be delayed because Intel complained about my reading of the manual"
* tag 'kvm-4.3-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
x86/kvm: Rename VMX's segment access rights defines
KVM: x86/vPMU: Fix unnecessary signed extension for AMD PERFCTRn
kvm: x86: Fix error handling in the function kvm_lapic_sync_from_vapic
KVM: s390: Fix assumption that kvm_set_irq_routing is always run successfully
KVM: VMX: drop ept misconfig check
KVM: MMU: fully check zero bits for sptes
KVM: MMU: introduce is_shadow_zero_bits_set()
KVM: MMU: introduce the framework to check zero bits on sptes
KVM: MMU: split reset_rsvds_bits_mask_ept
KVM: MMU: split reset_rsvds_bits_mask
KVM: MMU: introduce rsvd_bits_validate
KVM: MMU: move FNAME(is_rsvd_bits_set) to mmu.c
KVM: MMU: fix validation of mmio page fault
KVM: MTRR: Use default type for non-MTRR-covered gfn before WARN_ON
KVM: s390: host STP toleration for VMs
KVM: x86: clean/fix memory barriers in irqchip_in_kernel
KVM: document memory barriers for kvm->vcpus/kvm->online_vcpus
KVM: x86: remove unnecessary memory barriers for shared MSRs
KVM: move code related to KVM_SET_BOOT_CPU_ID to x86
KVM: s390: log capability enablement and vm attribute changes
...
DDI-E doesn't have the correspondent GMBUS pin.
We rely on VBT to tell us which one it being used instead.
The DVI/HDMI on shared port couldn't exist.
This patch isn't tested without hardware wchich has HDMI
on DDI-E.
v2: fix trailing whitespace
v3: MISSING_CASE take place of BUG()
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
While the dest comm string size is assured to be at least TASK_COMM_LEN long,
doing a memcpy() also adds the assumption that the source is at least that
long as well, which isn't assured, and isn't true in cases such as:
set_task_comm(worker->task, "kworker/dying");
This leads to accessing invalid memory.
Link: http://lkml.kernel.org/r/1440760018-1557-1-git-send-email-sasha.levin@oracle.com
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ASoC: Updates for v4.3
Not many updates to the core here, but an awful lot of driver updates
this time round:
- Factoring out of AC'97 reset code into the core
- New drivers for Cirrus CS4349, GTM601, InvenSense ICS43432, Realtek
RT298 and ST STI controllers.
- Machine drivers for Rockchip systems with MAX98090 and RT5645 and
RT5650.
- Initial driver support for Intel Skylake devices.
- A large number of cleanups for Lars-Peter Clausen and Axel Lin.