Simplify queue configuration and MSI-X allocation logic. Use a single
MSI-X information table for both NIC and ULDs. Remove hard-coded
MSI-X indices for firmware event queue and non data interrupts.
Instead, use the MSI-X bitmap to obtain a free MSI-X index
dynamically. Save each Rxq's index into the MSI-X information table,
within the Rxq structures themselves, for easier cleanup.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
QoS offload needs Ethernet Offload (ETHOFLD) resources present in the
NIC. These resources are shared with other ULDs. So, query firmware
for the available number of traffic classes, as well as, start and
end indices (EOTID) of the ETHOFLD region.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
io_uring supports request types that basically have two different
lifetimes:
1) Bounded completion time. These are requests like disk reads or writes,
which we know will finish in a finite amount of time.
2) Unbounded completion time. These are generally networked IO, where we
have no idea how long they will take to complete. Another example is
POLL commands.
This patch provides support for io-wq to handle these differently, so we
don't starve bounded requests by tying up workers for too long. By default
all work is bounded, unless otherwise specified in the work item.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We hold the wqe lock at this point (which is also annotated), so there's
no need to use the careful variant of list_empty().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.
After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.
Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for handling CQ ring overflow a bit smarter. We
should not have any functional changes in this patch. Most of the
changes are fairly straight forward, the only ones that stick out a bit
are the ones that change __io_free_req() to take the reference count
into account. If the request hasn't been submitted yet, we know it's
safe to simply ignore references and free it. But let's clean these up
too, as later patches will depend on the caller doing the right thing if
the completion logging grabs a reference to the request.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The rings can be derived from the ctx, and we need the ctx there for
a future change.
No functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.
This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're going to need this helper in a future patch, so move it out
of io_async_cancel() and into its own separate function.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch supports 2MB-aligned pinned file, which can guarantee no GC at all
by allocating fully valid 2MB segment.
Check free segments by has_not_enough_free_secs() with large budget.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If someone requests fscache on the mount, and the kernel doesn't
support it, it should fail the mount.
[ Drop ceph prefix -- it's provided by pr_err. ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Some firmware implementation gives error when a command is sent get mask
for core count 32-61. So use core count to decide.
But there is no function to get core count. So introduce one function to
get core count.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
There are some platforms, where there limited support of Intel(R) SST
features. Here perf-profile has only one base configuration and limited
support of commands. But still has support for discovery of base-freq and
turbo-freq features. So it is important to show minimum features to use
base-freq and turbo-freq features.
Here the change are:
- When there is no support of CONFIG_TDP_GET_LEVELS_INFO, then instead
of treating this as fatal error, treat this with number of config levels
= 0, that means only base level 0 is present.
- There is no support of mail box commands to get base frequencies or
turbo frequencies. Here present base frequency by reading cpufreq
base freq and turbo frequency by reading MSR 0x1AD.
- Don't display any field, which has value == 0.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Use different frequency weights for CLOS 0 and and CLOS1-3, to define
relative priority for power budgeting. This will be used for --auto
mode to enable base-freq and turbo-freq feature.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
There is an expectation in the CLX platform for SST base-freq feature that
Scaling min frequency be different for high and low priority cores.
This is the way the firmware will understand the priority.
So this change will look at high priority and low priority cores, and set
scaling_min_freq to P1High for high priority cores and P1Low to low
priority cores.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
In CLX_N base_frequency is read from cpufreq sysfs, where units are in
KHz. The internal units in the code matches the real ratios which are
in 100MHz scale. So when storing units for CLX-N frequencies, convert
to 100MHz scale.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Make the avx level display consistent. Except for "turbo-ratio-limits-avx",
everywhere else it is avx2. So change "turbo-ratio-limits-avx"
to "turbo-ratio-limits-avx2".
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Add support for uncore P0, uncore P1, P1 for base and AVX levels and
memory frequency. These commands are optional, so continue on
failure.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
When building with Clang + -Wtautological-constant-compare:
drivers/md/dm-raid.c:619:8: warning: converting the result of '<<' to a
boolean always evaluates to true [-Wtautological-constant-compare]
r = !RAID10_OFFSET;
^
drivers/md/dm-raid.c:517:28: note: expanded from macro 'RAID10_OFFSET'
#define RAID10_OFFSET (1 << 16) /* stripes with data
copies area adjacent on devices */
^
1 warning generated.
Negating a non-zero number will always make it zero, which is the
default value of r in this function so this statement is unnecessary;
remove it so that clang no longer warns.
Link: https://github.com/ClangBuiltLinux/linux/issues/753
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We can just call dma_free_contiguous directly instead of wrapping it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
It's not necessary to adjust the task state and revisit the state
of source and destination cgroups if the cgroups are not in freeze
state and the task itself is not frozen.
And in this scenario, it wakes up the task who's not supposed to be
ready to run.
Don't do the unnecessary task state adjustment can help stop waking
up the task without a reason.
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
On architectures where loff_t is wider than pgoff_t, the expression
((page->index + 1) << PAGE_SHIFT) can overflow. Rewrite to use the page
offset, which we already compute here anyway.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Andrii Nakryiko says:
====================
Github's mirror of libbpf got LGTM and Coverity statis analysis running
against it and spotted few real bugs and few potential issues. This patch
series fixes found issues.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>