If the 1st NONHEAD lcluster of a pcluster isn't CBLKCNT lcluster type
rather than a HEAD or PLAIN type instead, which means its pclustersize
_must_ be 1 lcluster (since its uncompressed size < 2 lclusters),
as illustrated below:
HEAD HEAD / PLAIN lcluster type
____________ ____________
|_:__________|_________:__| file data (uncompressed)
. .
.____________.
|____________| pcluster data (compressed)
Such on-disk case was explained before [1] but missed to be handled
properly in the runtime implementation.
It can be observed if manually generating 1 lcluster-sized pcluster
with 2 lclusters (thus CBLKCNT doesn't exist.) Let's fix it now.
[1] https://lore.kernel.org/r/20210407043927.10623-1-xiang@kernel.org
Link: https://lore.kernel.org/r/20210510064715.29123-1-xiang@kernel.org
Fixes: cec6e93beadf ("erofs: support parsing big pcluster compress indexes")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <xiang@kernel.org>
Bug: 201372112
Change-Id: I7e46baa993790f8908287ac36e19d32e536116db
(cherry picked from commit 0852b6ca941ef3ff75076e85738877bd3271e1cd)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Prior to big pcluster, there was only one compressed page so it'd
easy to map this. However, when big pcluster is enabled, more work
needs to be done to handle multiple compressed pages. In detail,
- (maptype 0) if there is only one compressed page + no need
to copy inplace I/O, just map it directly what we did before;
- (maptype 1) if there are more compressed pages + no need to
copy inplace I/O, vmap such compressed pages instead;
- (maptype 2) if inplace I/O needs to be copied, use per-CPU
buffers for decompression then.
Another thing is how to detect inplace decompression is feasable or
not (it's still quite easy for non big pclusters), apart from the
inplace margin calculation, inplace I/O page reusing order is also
needed to be considered for each compressed page. Currently, if the
compressed page is the xth page, it shouldn't be reused as [0 ...
nrpages_out - nrpages_in + x], otherwise a full copy will be triggered.
Although there are some extra optimization ideas for this, I'd like
to make big pcluster work correctly first and obviously it can be
further optimized later since it has nothing with the on-disk format
at all.
Link: https://lore.kernel.org/r/20210407043927.10623-10-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: I8e8b90bd0a401850ea81d895dc55e5d8a2772ed7
(cherry picked from commit 598162d050801e556750defff4ddab499e5d76ed)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Different from non-compact indexes, several lclusters are packed
as the compact form at once and an unique base blkaddr is stored for
each pack, so each lcluster index would take less space on avarage
(e.g. 2 bytes for COMPACT_2B.) btw, that is also why BIG_PCLUSTER
switch should be consistent for compact head0/1.
Prior to big pcluster, the size of all pclusters was 1 lcluster.
Therefore, when a new HEAD lcluster was scanned, blkaddr would be
bumped by 1 lcluster. However, that way doesn't work anymore for
big pcluster since we actually don't know the compressed size of
pclusters in advance (before reading CBLKCNT lcluster).
So, instead, let blkaddr of each pack be the first pcluster blkaddr
with a valid CBLKCNT, in detail,
1) if CBLKCNT starts at the pack, this first valid pcluster is
itself, e.g.
_____________________________________________________________
|_CBLKCNT0_|_NONHEAD_| .. |_HEAD_|_CBLKCNT1_| ... |_HEAD_| ...
^ = blkaddr base ^ += CBLKCNT0 ^ += CBLKCNT1
2) if CBLKCNT doesn't start at the pack, the first valid pcluster
is the next pcluster, e.g.
_________________________________________________________
| NONHEAD_| .. |_HEAD_|_CBLKCNT0_| ... |_HEAD_|_HEAD_| ...
^ = blkaddr base ^ += CBLKCNT0
^ += 1
When a CBLKCNT is found, blkaddr will be increased by CBLKCNT
lclusters, or a new HEAD is found immediately, bump blkaddr by 1
instead (see the picture above.)
Also noted if CBLKCNT is the end of the pack, instead of storing
delta1 (distance of the next HEAD lcluster) as normal NONHEADs,
it still uses the compressed block count (delta0) since delta1
can be calculated indirectly but the block count can't.
Adjust decoding logic to fit big pcluster compact indexes as well.
Link: https://lore.kernel.org/r/20210407043927.10623-9-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: I09d6bb4c9d390dc6169cb9dd4efbc7fd6b3be5d0
(cherry picked from commit b86269f43892316ef5a177d7180d09d101a46f22)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
When INCOMPAT_BIG_PCLUSTER sb feature is enabled, legacy compress indexes
will also have the same on-disk header compact indexes to keep per-file
configurations instead of leaving it zeroed.
If ADVISE_BIG_PCLUSTER is set for a file, CBLKCNT will be loaded for each
pcluster in this file by parsing 1st non-head lcluster.
Link: https://lore.kernel.org/r/20210407043927.10623-8-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: I97b8d1cc54e057789d22f1cdfafc21ed6a69b149
(cherry picked from commit cec6e93beadfd145758af2c0854fcc2abb8170cb)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Big pcluster indicates the size of compressed data for each physical
pcluster is no longer fixed as block size, but could be more than 1
block (more accurately, 1 logical pcluster)
When big pcluster feature is enabled for head0/1, delta0 of the 1st
non-head lcluster index will keep block count of this pcluster in
lcluster size instead of 1. Or, the compressed size of pcluster
should be 1 lcluster if pcluster has no non-head lcluster index.
Also note that BIG_PCLUSTER feature reuses COMPR_CFGS feature since
it depends on COMPR_CFGS and will be released together.
Link: https://lore.kernel.org/r/20210407043927.10623-6-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: Ibd33f2764f4420f370f9293ed325efdba9ea70a7
(cherry picked from commit 5404c33010cb8ee063c05376d4a2eba129872281)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
When picking up inplace I/O pages, it should be traversed in reverse
order in aligned with the traversal order of file-backed online pages.
Also, index should be updated together when preloading compressed pages.
Previously, only page-sized pclustersize was supported so no problem
at all. Also rename `compressedpages' to `icpage_ptr' to reflect its
functionality.
Link: https://lore.kernel.org/r/20210407043927.10623-5-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: I752769d004c18f74636bcfdea767572daa6f7072
(cherry picked from commit 81382f5f5cb0c9c5694c19d36460f757a8c96841)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Since multiple pcluster sizes could be used at once, the number of
compressed pages will become a variable factor. It's necessary to
introduce slab pools rather than a single slab cache now.
This limits the pclustersize to 1M (Z_EROFS_PCLUSTER_MAX_SIZE), and
get rid of the obsolete EROFS_FS_CLUSTER_PAGE_LIMIT, which has no
use now.
Link: https://lore.kernel.org/r/20210407043927.10623-4-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: I821f3cf35820dbb320eeedc3ae934fd1d455dfd7
(cherry picked from commit 9f6cc76e6ff0631a99cd94eab8af137057633a52)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
To deal the with the cases which inplace decompression is infeasible
for some inplace I/O. Per-CPU buffers was introduced to get rid of page
allocation latency and thrash for low-latency decompression algorithms
such as lz4.
For the big pcluster feature, introduce multipage per-CPU buffers to
keep such inplace I/O pclusters temporarily as well but note that
per-CPU pages are just consecutive virtually.
When a new big pcluster fs is mounted, its max pclustersize will be
read and per-CPU buffers can be growed if needed. Shrinking adjustable
per-CPU buffers is more complex (because we don't know if such size
is still be used), so currently just release them all when unloading.
Link: https://lore.kernel.org/r/20210409190630.19569-1-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: I5b3ab69ef58aaea635911b0c08cceeb4b38122da
(cherry picked from commit 524887347fcb67faa0a63dd3c4c02ab48d4968d4)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Formal big pcluster design is actually more powerful / flexable than
the previous thought whose pclustersize was fixed as power-of-2 blocks,
which was obviously inefficient and space-wasting. Instead, pclustersize
can now be set independently for each pcluster, so various pcluster
sizes can also be used together in one file if mkfs wants (for example,
according to data type and/or compression ratio).
Let's get rid of previous physical_clusterbits[] setting (also notice
that corresponding on-disk fields are still 0 for now). Therefore,
head1/2 can be used for at most 2 different algorithms in one file and
again pclustersize is now independent of these.
Link: https://lore.kernel.org/r/20210407043927.10623-2-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: I7db236f06b202949b882a366b347f0c4e9dc0c3e
(cherry picked from commit 54e0b6c873dcbd02b9b479c893f6fba8fcbc6a9c)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Add a bitmap for available compression algorithms and a variable-sized
on-disk table for compression options in preparation for upcoming big
pcluster and LZMA algorithm, which follows the end of super block.
To parse the compression options, the bitmap is scanned one by one.
For each available algorithm, there is data followed by 2-byte `length'
correspondingly (it's enough for most cases, or entire fs blocks should
be used.)
With such available algorithm bitmap, kernel itself can also refuse to
mount such filesystem if any unsupported compression algorithm exists.
Note that COMPR_CFGS feature will be enabled with BIG_PCLUSTER.
Link: https://lore.kernel.org/r/20210329100012.12980-1-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: I9f9c69d5cd05da8c5f17e3650511e83c5f89200e
(cherry picked from commit 14373711dd54be8a84e2f4f624bc58787f80cfbd)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Currently, erofs_map_blocks() will be called only from
erofs_{bmap, read_raw_page} which are all for uncompressed files.
So, the compression branch in erofs_map_blocks() is pointless. Let's
remove it and use erofs_map_blocks_flatmode() directly. Also update
related comments.
Link: https://lore.kernel.org/r/20210325071008.573-1-zbestahu@gmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: Ic3362c62d7c917f05f1cd11120ee6280a22221b8
(cherry picked from commit 8137824eddd2e790c61c70c20d70a087faca95fa)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Add a missing case which could cause unnecessary page allocation but
not directly use inplace I/O instead, which increases runtime extra
memory footprint.
The detail is, considering an online file-backed page, the right half
of the page is chosen to be cached (e.g. the end page of a readahead
request) and some of its data doesn't exist in managed cache, so the
pcluster will be definitely kept in the submission chain. (IOWs, it
cannot be decompressed without I/O, e.g., due to the bypass queue).
Currently, DELAYEDALLOC/TRYALLOC cases can be downgraded as NOINPLACE,
and stop online pages from inplace I/O. After this patch, unneeded page
allocations won't be observed in pickup_page_for_submission() then.
Link: https://lore.kernel.org/r/20210321183227.5182-1-hsiangkao@aol.com
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Bug: 201372112
Change-Id: I639279e87b749b8625d2a8e77055a9fc52073542
(cherry picked from commit 0b964600d3aae56ff9d5bdd710a79f39a44c572c)
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
kernel test robot reported the regression of fio.write_iops[1] with [2].
Since lru_add_drain is called frequently, invalidate bh_lrus there could
increase bh_lrus cache miss ratio, which needs more IO in the end.
This patch moves the bh_lrus invalidation from the hot path( e.g.,
zap_page_range, pagevec_release) to cold path(i.e., lru_add_drain_all,
lru_cache_disable).
"Xing, Zhengjun" confirmed
: I test the patch, the regression reduced to -2.9%.
[1] https://lore.kernel.org/lkml/20210520083144.GD14190@xsang-OptiPlex-9020/
[2] 8cc621d2f45d, mm: fs: invalidate BH LRU during page migration
Bug: 194673488
Link: https://lkml.kernel.org/r/20210907212347.1977686-1-minchan@kernel.org
(cherry picked from commit 243418e3925d5b5b0657ae54c322d43035e97eed)
[Chris: resolved conflicts due to Minchan's AOSP LRU commits]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Tested-by: "Xing, Zhengjun" <zhengjun.xing@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
Change-Id: Icc5e456b058df516480b4378853464d6d7b43505
When FUSE passthrough is used, the lower file system file is manipulated
directly, but neither mtime, atime or ctime of the referencing FUSE file
is updated.
Fix by updating the file times when passthrough operations are
performed.
Bug: 200779468
Reported-by: Fengnan Chang <changfengnan@vivo.com>
Reported-by: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Alessio Balsini <balsini@google.com>
Change-Id: I35b72196b2cc1d79a9f62ddb32e2cfa934c3b6d3
The page used to contain the fuse_dentry_canonical_path to be handled in
fuse_dev_do_write is allocated using __get_free_pages(GFP_KERNEL).
The returned page may contain undefined data, that by chance may be
considered as a valid path name that is not in the cache. In that case,
if the FUSE daemon mistakenly doesn't fill the canonical path buffer,
the FUSE driver may fall into two blocking
request_wait_answer(fuse_dev_write->kern_path->fuse_lookup_name)
causing a deadlock condition.
The stack is as follows:
find S 0 20511 20117 0x00000000
Call trace:
[<ffffff8008085e78>] __switch_to+0xb8/0xd4
[<ffffff8008a0cac4>] __schedule+0x458/0x714
[<ffffff8008a0ce0c>] schedule+0x8c/0xa8
[<ffffff800833865c>] request_wait_answer+0x74/0x220
[<ffffff8008339f70>] __fuse_request_send+0x8c/0xa0
[<ffffff8008339fe4>] fuse_request_send+0x60/0x6c
[<ffffff800833c1a8>] fuse_dentry_canonical_path+0xb8/0x104
[<ffffff800820b14c>] do_sys_open+0x1b4/0x260
[<ffffff800820b27c>] SyS_openat+0x3c/0x4c
[<ffffff8008083540>] el0_svc_naked+0x34/0x38
mount.ntfs-3g S 0 5845 1 0x00000000
Call trace:
[<ffffff8008085e78>] __switch_to+0xb8/0xd4
[<ffffff8008a0cac4>] __schedule+0x458/0x714
[<ffffff8008a0ce0c>] schedule+0x8c/0xa8
[<ffffff800833865c>] request_wait_answer+0x74/0x220
[<ffffff8008339f70>] __fuse_request_send+0x8c/0xa0
[<ffffff8008339fe4>] fuse_request_send+0x60/0x6c
[<ffffff800833bdb0>] fuse_simple_request+0x128/0x16c
[<ffffff800833dddc>] fuse_lookup_name+0x104/0x1b0
[<ffffff800833dee4>] fuse_lookup+0x5c/0x11c
[<ffffff800821861c>] lookup_slow+0xfc/0x174
[<ffffff800821b474>] walk_component+0xf0/0x290
[<ffffff800821bbac>] path_lookupat+0xa0/0x128
[<ffffff800821c7f4>] filename_lookup+0x84/0x124
[<ffffff800821c8d8>] kern_path+0x44/0x54
[<ffffff800833b0c8>] fuse_dev_do_write+0x828/0xa0c
[<ffffff800833b610>] fuse_dev_write+0x90/0xb4
[<ffffff800820b770>] do_iter_readv_writev+0xf4/0x13c
[<ffffff800820cc88>] do_readv_writev+0xec/0x220
[<ffffff800820d05c>] vfs_writev+0x60/0x74
[<ffffff800820d0ec>] do_writev+0x7c/0x100
[<ffffff800820e348>] SyS_writev+0x38/0x48
[<ffffff8008083540>] el0_svc_naked+0x34/0x38
Fix by ensuring that the page allocated for the canonical path is zeroed.
Bug: 194856119
Bug: 196051870
Fixes: 24ab59f6bb42 ("ANDROID: fuse: Add support for d_canonical_path")
Signed-off-by: Biao Li <libiao@allwinnertech.com>
Signed-off-by: Shuosheng Huang <huangshuosheng@allwinnertech.com>
Signed-off-by: Alessio Balsini <balsini@google.com>
Change-Id: I400815dc1049d90c308f5cf87ce60de97ff82131
We must flush all the dirty data when enabling checkpoint back. Let's guarantee
that first by adding a retry logic on sync_inodes_sb(). In addition to that,
this patch adds to flush data in fsync when checkpoint is disabled, which can
mitigate the sync_inodes_sb() failures in advance.
Bug: 194449609
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(cherry picked from commit dddd3d65293a52c2c3850c19b1e5115712e534d8)
Change-Id: I5bbef7386ddbb44fd925262fb68a8ef0a4960993
Since commit 1b6b26ae70 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.
In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already. Doing
extraneous wakeups will only cause potential thundering herd problems.
However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it". Even if there was no edge in sight.
Quoting Sandeep Patil:
"The commit 1b6b26ae70 ('pipe: fix and clarify pipe write wakeup
logic') changed pipe write logic to wakeup readers only if the pipe
was empty at the time of write. However, there are libraries that
relied upon the older behavior for notification scheme similar to
what's described in [1]
One such library 'realm-core'[2] is used by numerous Android
applications. The library uses a similar notification mechanism as GNU
Make but it never drains the pipe until it is full. When Android moved
to v5.10 kernel, all applications using this library stopped working.
The library has since been fixed[3] but it will be a while before all
applications incorporate the updated library"
Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong. Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.
So add the extraneous wakeup, to approximate the old behavior.
[ I say "approximate", because the exact old behavior was to do a wakeup
not for each write(), but for each pipe buffer chunk that was filled
in. The behavior introduced by this change is not that - this is just
"every write will cause a wakeup, whether necessary or not", which
seems to be sufficient for the broken library use. ]
It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.
See commit f467a6a664 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".
Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Bug: 193851993
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3a34b13a88caeb2800ab44a4918f230041b37dd9)
Signed-off-by: Sandeep Patil <sspatil@android.com>
Change-Id: Idcf3e8faa31bff47ada4b815237a355e0757b964
This tries to fix priority inversion in the below condition resulting in
long checkpoint delay.
f2fs_get_node_info()
- nat_tree_lock
-> sleep to grab journal_rwsem by contention
checkpoint
- waiting for nat_tree_lock
In order to let checkpoint go, let's release nat_tree_lock, if there's a
journal_rwsem contention.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Bug: 191987855
(cherry picked from commit 2eeb0dce728a7eac3e4dfe355d98af40d61f7a26
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Change-Id: I97ac4f9d3bde399ab4f17f5b3a6e949ae9b79f0f
Signed-off-by: Daeho Jeong <daehojeong@google.com>
commit '1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic")'
change `pipe_write()` wakeup logic to wakeup readers only if the pipe
was empty.
This meant that applications that are not draining the pipe
before each write were exposed to unexpected timeouts / hangs in
epoll_wait() waiting for data in a pipe using EPOLLIN | EPOLLET flags.
This behaviour can be easily tested with android12-5.4 kernel where
the test that uses pipes for notifications in this way works while it
fails 100% with android12-5.10.
This change restores the old behavior to wakeup all pipe_readers if any
new data is written to the pipe.
Bug: 193851993
Bug: 193846582
Change-Id: If0c5a844091ccf16d5236bd072326325d4d5447a
Signed-off-by: Sandeep Patil <sspatil@google.com>
SBI_NEED_FSCK is an indicator that fsck.f2fs needs to be triggered, so it
is not fully critical to stop any IO writes. So, let's allow to write data
instead of reporting EIO forever given SBI_NEED_FSCK, but do keep OPU.
Bug: 193659742
Fixes: 955772787667 ("f2fs: drop inplace IO if fs status is abnormal")
Cc: <stable@kernel.org> # v5.13+
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(cherry picked from commit 1ffc8f5f7751f91fe6af527d426a723231b741a6
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Change-Id: I9585358c1cee864064d9d210ee643f49ff1e6749
This effectively locks down OWNERS approval to a small group to guard
the code base against unintentional breakages.
Bug: 194314089
Signed-off-by: Matthias Maennich <maennich@google.com>
Change-Id: Ifd1ea97639a622320ea83f901f6451e2e52b38d4
Added gc_reclaimed_segments and gc_segment_mode sysfs nodes.
1) "gc_reclaimed_segments" shows how many segments have been
reclaimed by GC during a specific GC mode.
2) "gc_segment_mode" is used to control for which gc mode
the "gc_reclaimed_segments" node shows.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Bug: 182708936
(cherry picked from commit 07c6b5933ebf58b6132aea9f3e72a62486882bfb
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Change-Id: Ie8c2ccf5d36ce2f388c98b77e22848f3ff6645c3
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Some of the platforms might be still expecting dedicated memory region
for ramoops node. So add logic to detect the start and size of the
ramoops memory region by looking up reserved memory region with
of_reserved_mem_lookup() when platform_get_resource() failed.
Bug: 191636717
Change-Id: Idc479b45fb3f637f7235efd6eabac62059d5e92b
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting. Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to a
DMA buffer.
Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both are only readable by the process owner, as
follows:
1. Do a readlink on each FD.
2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
3. stat the file to get the dmabuf inode number.
4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
Accessing other processes' fdinfo requires root privileges. This limits
the use of the interface to debugging environments and is not suitable for
production builds. Granting root privileges even to a system process
increases the attack surface and is highly undesirable.
Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.
Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Jann Horn <jannh@google.com>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 7bc3fa0172a423afb34e6df7a3998e5f23b1a94a)
Bug: 159126739
Bug: 167141117
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I842b689670f731138592f45c7124ef446d9aa59a
Through this vendor hook, we can get the timing to check
current running task for the validation of its credential
and open operation.
Bug: 191291287
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Ia644ceb02dbc230ee1d25cad3630c2c3f908e41a
The reserved memory region for ramoops is assumed to be at a fixed
and known location when read from the devicetree. This is not desirable
in environments where it is preferred for the region to be dynamically
allocated at runtime, as opposed to it being fixed at compile time.
Change the logic for detecting the start and size of the ramoops
memory region by looking up the reserved memory region instead of
using platform_get_resource(), which assumes that the location
of the memory is known ahead of time.
Bug: 191636717
Link: https://lore.kernel.org/patchwork/patch/1451704/
Change-Id: I24066de9f4fe1f1575cb1bbb1687c37a2b1938a4
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
* aosp/upstream-f2fs-stable-linux-5.10.y:
Revert "f2fs: avoid attaching SB_ACTIVE flag during mount/remount"
f2fs: remove false alarm on iget failure during GC
f2fs: enable extent cache for compression files in read-only
f2fs: fix to avoid adding tab before doc section
f2fs: introduce f2fs_casefolded_name slab cache
f2fs: swap: support migrating swapfile in aligned write mode
f2fs: swap: remove dead codes
f2fs: compress: add compress_inode to cache compressed blocks
f2fs: clean up /sys/fs/f2fs/<disk>/features
f2fs: add pin_file in feature list
f2fs: Advertise encrypted casefolding in sysfs
f2fs: Show casefolding support only when supported
f2fs: support RO feature
f2fs: logging neatening
Bug: 186107892
Bug: 190759634
Bug: 190517210
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Change-Id: I8eb93d8a43304b98166676da52a9c2434b15b942
This patch removes setting SBI_NEED_FSCK when GC gets an error on f2fs_iget,
since f2fs_iget can give ENOMEM and others by race condition.
If we set this critical fsck flag, we'll get EIO during fsync via the below
code path.
In f2fs_inplace_write_data(),
if (is_sbi_flag_set(sbi, SBI_NEED_FSCK) || f2fs_cp_error(sbi)) {
err = -EIO;
goto drop_bio;
}
Fixes: 9557727876674 ("f2fs: drop inplace IO if fs status is abnormal")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a slab cache: "f2fs_casefolded_name" for memory allocation
of casefold name.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Changes in 5.10.43
btrfs: tree-checker: do not error out if extent ref hash doesn't match
net: usb: cdc_ncm: don't spew notifications
hwmon: (dell-smm-hwmon) Fix index values
hwmon: (pmbus/isl68137) remove READ_TEMPERATURE_3 for RAA228228
netfilter: conntrack: unregister ipv4 sockopts on error unwind
efi/fdt: fix panic when no valid fdt found
efi: Allow EFI_MEMORY_XP and EFI_MEMORY_RO both to be cleared
efi/libstub: prevent read overflow in find_file_option()
efi: cper: fix snprintf() use in cper_dimm_err_location()
vfio/pci: Fix error return code in vfio_ecap_init()
vfio/pci: zap_vma_ptes() needs MMU
samples: vfio-mdev: fix error handing in mdpy_fb_probe()
vfio/platform: fix module_put call in error flow
ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service
HID: logitech-hidpp: initialize level variable
HID: pidff: fix error return code in hid_pidff_init()
HID: i2c-hid: fix format string mismatch
devlink: Correct VIRTUAL port to not have phys_port attributes
net/sched: act_ct: Offload connections with commit action
net/sched: act_ct: Fix ct template allocation for zone 0
mptcp: always parse mptcp options for MPC reqsk
nvme-rdma: fix in-casule data send for chained sgls
ACPICA: Clean up context mutex during object deletion
perf probe: Fix NULL pointer dereference in convert_variable_location()
net: dsa: tag_8021q: fix the VLAN IDs used for encoding sub-VLANs
net: sock: fix in-kernel mark setting
net/tls: Replace TLS_RX_SYNC_RUNNING with RCU
net/tls: Fix use-after-free after the TLS device goes down and up
net/mlx5e: Fix incompatible casting
net/mlx5: Check firmware sync reset requested is set before trying to abort it
net/mlx5e: Check for needed capability for cvlan matching
net/mlx5: DR, Create multi-destination flow table with level less than 64
nvmet: fix freeing unallocated p2pmem
netfilter: nft_ct: skip expectations for confirmed conntrack
netfilter: nfnetlink_cthelper: hit EBUSY on updates if size mismatches
drm/i915/selftests: Fix return value check in live_breadcrumbs_smoketest()
bpf: Simplify cases in bpf_base_func_proto
bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
ieee802154: fix error return code in ieee802154_add_iface()
ieee802154: fix error return code in ieee802154_llsec_getparams()
igb: add correct exception tracing for XDP
ixgbevf: add correct exception tracing for XDP
cxgb4: fix regression with HASH tc prio value update
ipv6: Fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions
ice: Fix allowing VF to request more/less queues via virtchnl
ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared
ice: handle the VF VSI rebuild failure
ice: report supported and advertised autoneg using PHY capabilities
ice: Allow all LLDP packets from PF to Tx
i2c: qcom-geni: Add shutdown callback for i2c
cxgb4: avoid link re-train during TC-MQPRIO configuration
i40e: optimize for XDP_REDIRECT in xsk path
i40e: add correct exception tracing for XDP
ice: simplify ice_run_xdp
ice: optimize for XDP_REDIRECT in xsk path
ice: add correct exception tracing for XDP
ixgbe: optimize for XDP_REDIRECT in xsk path
ixgbe: add correct exception tracing for XDP
arm64: dts: ti: j7200-main: Mark Main NAVSS as dma-coherent
optee: use export_uuid() to copy client UUID
bus: ti-sysc: Fix am335x resume hang for usb otg module
arm64: dts: ls1028a: fix memory node
arm64: dts: zii-ultra: fix 12V_MAIN voltage
arm64: dts: freescale: sl28: var4: fix RGMII clock and voltage
ARM: dts: imx7d-meerkat96: Fix the 'tuning-step' property
ARM: dts: imx7d-pico: Fix the 'tuning-step' property
ARM: dts: imx: emcon-avari: Fix nxp,pca8574 #gpio-cells
bus: ti-sysc: Fix flakey idling of uarts and stop using swsup_sidle_act
tipc: add extack messages for bearer/media failure
tipc: fix unique bearer names sanity check
serial: stm32: fix threaded interrupt handling
riscv: vdso: fix and clean-up Makefile
io_uring: fix link timeout refs
io_uring: use better types for cflags
drm/amdgpu/vcn3: add cancel_delayed_work_sync before power gate
drm/amdgpu/jpeg2.5: add cancel_delayed_work_sync before power gate
drm/amdgpu/jpeg3: add cancel_delayed_work_sync before power gate
Bluetooth: fix the erroneous flush_work() order
Bluetooth: use correct lock to prevent UAF of hdev object
wireguard: do not use -O3
wireguard: peer: allocate in kmem_cache
wireguard: use synchronize_net rather than synchronize_rcu
wireguard: selftests: remove old conntrack kconfig value
wireguard: selftests: make sure rp_filter is disabled on vethc
wireguard: allowedips: initialize list head in selftest
wireguard: allowedips: remove nodes in O(1)
wireguard: allowedips: allocate nodes in kmem_cache
wireguard: allowedips: free empty intermediate nodes when removing single node
net: caif: added cfserl_release function
net: caif: add proper error handling
net: caif: fix memory leak in caif_device_notify
net: caif: fix memory leak in cfusbl_device_notify
HID: i2c-hid: Skip ELAN power-on command after reset
HID: magicmouse: fix NULL-deref on disconnect
HID: multitouch: require Finger field to mark Win8 reports as MT
gfs2: fix scheduling while atomic bug in glocks
ALSA: timer: Fix master timer notification
ALSA: hda: Fix for mute key LED for HP Pavilion 15-CK0xx
ALSA: hda: update the power_state during the direct-complete
ARM: dts: imx6dl-yapp4: Fix RGMII connection to QCA8334 switch
ARM: dts: imx6q-dhcom: Add PU,VDD1P1,VDD2P5 regulators
ext4: fix memory leak in ext4_fill_super
ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed
ext4: fix fast commit alignment issues
ext4: fix memory leak in ext4_mb_init_backend on error path.
ext4: fix accessing uninit percpu counter variable with fast_commit
usb: dwc2: Fix build in periphal-only mode
pid: take a reference when initializing `cad_pid`
ocfs2: fix data corruption by fallocate
mm/debug_vm_pgtable: fix alignment for pmd/pud_advanced_tests()
mm/page_alloc: fix counting of free pages after take off from buddy
x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
x86/sev: Check SME/SEV support in CPUID first
nfc: fix NULL ptr dereference in llcp_sock_getname() after failed connect
drm/amdgpu: Don't query CE and UE errors
drm/amdgpu: make sure we unpin the UVD BO
x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing
powerpc/kprobes: Fix validation of prefixed instructions across page boundary
btrfs: mark ordered extent and inode with error if we fail to finish
btrfs: fix error handling in btrfs_del_csums
btrfs: return errors from btrfs_del_csums in cleanup_ref_head
btrfs: fixup error handling in fixup_inode_link_counts
btrfs: abort in rename_exchange if we fail to insert the second ref
btrfs: fix deadlock when cloning inline extents and low on available space
mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPY
drm/msm/dpu: always use mdp device to scale bandwidth
btrfs: fix unmountable seed device after fstrim
KVM: SVM: Truncate GPR value for DR and CR accesses in !64-bit mode
KVM: arm64: Fix debug register indexing
x86/kvm: Teardown PV features on boot CPU as well
x86/kvm: Disable kvmclock on all CPUs on shutdown
x86/kvm: Disable all PV features on crash
lib/lz4: explicitly support in-place decompression
i2c: qcom-geni: Suspend and resume the bus during SYSTEM_SLEEP_PM ops
netfilter: nf_tables: missing error reporting for not selected expressions
xen-netback: take a reference to the RX task thread
neighbour: allow NUD_NOARP entries to be forced GCed
Linux 5.10.43
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8d7ec0878193e4e454076809b7fb71fcc4e3d810
commit 5e753a817b2d5991dfe8a801b7b1e8e79a1c5a20 upstream.
The following test case reproduces an issue of wrongly freeing in-use
blocks on the readonly seed device when fstrim is called on the rw sprout
device. As shown below.
Create a seed device and add a sprout device to it:
$ mkfs.btrfs -fq -dsingle -msingle /dev/loop0
$ btrfstune -S 1 /dev/loop0
$ mount /dev/loop0 /btrfs
$ btrfs dev add -f /dev/loop1 /btrfs
BTRFS info (device loop0): relocating block group 290455552 flags system
BTRFS info (device loop0): relocating block group 1048576 flags system
BTRFS info (device loop0): disk added /dev/loop1
$ umount /btrfs
Mount the sprout device and run fstrim:
$ mount /dev/loop1 /btrfs
$ fstrim /btrfs
$ umount /btrfs
Now try to mount the seed device, and it fails:
$ mount /dev/loop0 /btrfs
mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
Block 5292032 is missing on the readonly seed device:
$ dmesg -kt | tail
<snip>
BTRFS error (device loop0): bad tree block start, want 5292032 have 0
BTRFS warning (device loop0): couldn't read-tree root
BTRFS error (device loop0): open_ctree failed
>From the dump-tree of the seed device (taken before the fstrim). Block
5292032 belonged to the block group starting at 5242880:
$ btrfs inspect dump-tree -e /dev/loop0 | grep -A1 BLOCK_GROUP
<snip>
item 3 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16169 itemsize 24
block group used 114688 chunk_objectid 256 flags METADATA
<snip>
>From the dump-tree of the sprout device (taken before the fstrim).
fstrim used block-group 5242880 to find the related free space to free:
$ btrfs inspect dump-tree -e /dev/loop1 | grep -A1 BLOCK_GROUP
<snip>
item 1 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16226 itemsize 24
block group used 32768 chunk_objectid 256 flags METADATA
<snip>
BPF kernel tracing the fstrim command finds the missing block 5292032
within the range of the discarded blocks as below:
kprobe:btrfs_discard_extent {
printf("freeing start %llu end %llu num_bytes %llu:\n",
arg1, arg1+arg2, arg2);
}
freeing start 5259264 end 5406720 num_bytes 147456
<snip>
Fix this by avoiding the discard command to the readonly seed device.
Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 76a6d5cd74479e7ec8a7f9a29bce63d5549b6b2e upstream.
There are a few cases where cloning an inline extent requires copying data
into a page of the destination inode. For these cases we are allocating
the required data and metadata space while holding a leaf locked. This can
result in a deadlock when we are low on available space because allocating
the space may flush delalloc and two deadlock scenarios can happen:
1) When starting writeback for an inode with a very small dirty range that
fits in an inline extent, we deadlock during the writeback when trying
to insert the inline extent, at cow_file_range_inline(), if the extent
is going to be located in the leaf for which we are already holding a
read lock;
2) After successfully starting writeback, for non-inline extent cases,
the async reclaim thread will hang waiting for an ordered extent to
complete if the ordered extent completion needs to modify the leaf
for which the clone task is holding a read lock (for adding or
replacing file extent items). So the cloning task will wait forever
on the async reclaim thread to make progress, which in turn is
waiting for the ordered extent completion which in turn is waiting
to acquire a write lock on the same leaf.
So fix this by making sure we release the path (and therefore the leaf)
every time we need to copy the inline extent's data into a page of the
destination inode, as by that time we do not need to have the leaf locked.
Fixes: 05a5a7621c ("Btrfs: implement full reflink support for inline extents")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dc09ef3562726cd520c8338c1640872a60187af5 upstream.
Error injection stress uncovered a problem where we'd leave a dangling
inode ref if we failed during a rename_exchange. This happens because
we insert the inode ref for one side of the rename, and then for the
other side. If this second inode ref insert fails we'll leave the first
one dangling and leave a corrupt file system behind. Fix this by
aborting if we did the insert for the first inode ref.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 011b28acf940eb61c000059dd9e2cfcbf52ed96b upstream.
This function has the following pattern
while (1) {
ret = whatever();
if (ret)
goto out;
}
ret = 0
out:
return ret;
However several places in this while loop we simply break; when there's
a problem, thus clearing the return value, and in one case we do a
return -EIO, and leak the memory for the path.
Fix this by re-arranging the loop to deal with ret == 1 coming from
btrfs_search_slot, and then simply delete the
ret = 0;
out:
bit so everybody can break if there is an error, which will allow for
proper error handling to occur.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>