Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
This commit is contained in:
50
Documentation/filesystems/00-INDEX
Normal file
50
Documentation/filesystems/00-INDEX
Normal file
@@ -0,0 +1,50 @@
|
||||
00-INDEX
|
||||
- this file (info on some of the filesystems supported by linux).
|
||||
Locking
|
||||
- info on locking rules as they pertain to Linux VFS.
|
||||
adfs.txt
|
||||
- info and mount options for the Acorn Advanced Disc Filing System.
|
||||
affs.txt
|
||||
- info and mount options for the Amiga Fast File System.
|
||||
bfs.txt
|
||||
- info for the SCO UnixWare Boot Filesystem (BFS).
|
||||
cifs.txt
|
||||
- description of the CIFS filesystem
|
||||
coda.txt
|
||||
- description of the CODA filesystem.
|
||||
cramfs.txt
|
||||
- info on the cram filesystem for small storage (ROMs etc)
|
||||
devfs/
|
||||
- directory containing devfs documentation.
|
||||
ext2.txt
|
||||
- info, mount options and specifications for the Ext2 filesystem.
|
||||
fat_cvf.txt
|
||||
- info on the Compressed Volume Files extension to the FAT filesystem
|
||||
hpfs.txt
|
||||
- info and mount options for the OS/2 HPFS.
|
||||
isofs.txt
|
||||
- info and mount options for the ISO 9660 (CDROM) filesystem.
|
||||
jfs.txt
|
||||
- info and mount options for the JFS filesystem.
|
||||
ncpfs.txt
|
||||
- info on Novell Netware(tm) filesystem using NCP protocol.
|
||||
ntfs.txt
|
||||
- info and mount options for the NTFS filesystem (Windows NT).
|
||||
proc.txt
|
||||
- info on Linux's /proc filesystem.
|
||||
romfs.txt
|
||||
- Description of the ROMFS filesystem.
|
||||
smbfs.txt
|
||||
- info on using filesystems with the SMB protocol (Windows 3.11 and NT)
|
||||
sysv-fs.txt
|
||||
- info on the SystemV/V7/Xenix/Coherent filesystem.
|
||||
udf.txt
|
||||
- info and mount options for the UDF filesystem.
|
||||
ufs.txt
|
||||
- info on the ufs filesystem.
|
||||
vfat.txt
|
||||
- info on using the VFAT filesystem used in Windows NT and Windows 95
|
||||
vfs.txt
|
||||
- Overview of the Virtual File System
|
||||
xfs.txt
|
||||
- info and mount options for the XFS filesystem.
|
176
Documentation/filesystems/Exporting
Normal file
176
Documentation/filesystems/Exporting
Normal file
@@ -0,0 +1,176 @@
|
||||
|
||||
Making Filesystems Exportable
|
||||
=============================
|
||||
|
||||
Most filesystem operations require a dentry (or two) as a starting
|
||||
point. Local applications have a reference-counted hold on suitable
|
||||
dentrys via open file descriptors or cwd/root. However remote
|
||||
applications that access a filesystem via a remote filesystem protocol
|
||||
such as NFS may not be able to hold such a reference, and so need a
|
||||
different way to refer to a particular dentry. As the alternative
|
||||
form of reference needs to be stable across renames, truncates, and
|
||||
server-reboot (among other things, though these tend to be the most
|
||||
problematic), there is no simple answer like 'filename'.
|
||||
|
||||
The mechanism discussed here allows each filesystem implementation to
|
||||
specify how to generate an opaque (out side of the filesystem) byte
|
||||
string for any dentry, and how to find an appropriate dentry for any
|
||||
given opaque byte string.
|
||||
This byte string will be called a "filehandle fragment" as it
|
||||
corresponds to part of an NFS filehandle.
|
||||
|
||||
A filesystem which supports the mapping between filehandle fragments
|
||||
and dentrys will be termed "exportable".
|
||||
|
||||
|
||||
|
||||
Dcache Issues
|
||||
-------------
|
||||
|
||||
The dcache normally contains a proper prefix of any given filesystem
|
||||
tree. This means that if any filesystem object is in the dcache, then
|
||||
all of the ancestors of that filesystem object are also in the dcache.
|
||||
As normal access is by filename this prefix is created naturally and
|
||||
maintained easily (by each object maintaining a reference count on
|
||||
its parent).
|
||||
|
||||
However when objects are included into the dcache by interpreting a
|
||||
filehandle fragment, there is no automatic creation of a path prefix
|
||||
for the object. This leads to two related but distinct features of
|
||||
the dcache that are not needed for normal filesystem access.
|
||||
|
||||
1/ The dcache must sometimes contain objects that are not part of the
|
||||
proper prefix. i.e that are not connected to the root.
|
||||
2/ The dcache must be prepared for a newly found (via ->lookup) directory
|
||||
to already have a (non-connected) dentry, and must be able to move
|
||||
that dentry into place (based on the parent and name in the
|
||||
->lookup). This is particularly needed for directories as
|
||||
it is a dcache invariant that directories only have one dentry.
|
||||
|
||||
To implement these features, the dcache has:
|
||||
|
||||
a/ A dentry flag DCACHE_DISCONNECTED which is set on
|
||||
any dentry that might not be part of the proper prefix.
|
||||
This is set when anonymous dentries are created, and cleared when a
|
||||
dentry is noticed to be a child of a dentry which is in the proper
|
||||
prefix.
|
||||
|
||||
b/ A per-superblock list "s_anon" of dentries which are the roots of
|
||||
subtrees that are not in the proper prefix. These dentries, as
|
||||
well as the proper prefix, need to be released at unmount time. As
|
||||
these dentries will not be hashed, they are linked together on the
|
||||
d_hash list_head.
|
||||
|
||||
c/ Helper routines to allocate anonymous dentries, and to help attach
|
||||
loose directory dentries at lookup time. They are:
|
||||
d_alloc_anon(inode) will return a dentry for the given inode.
|
||||
If the inode already has a dentry, one of those is returned.
|
||||
If it doesn't, a new anonymous (IS_ROOT and
|
||||
DCACHE_DISCONNECTED) dentry is allocated and attached.
|
||||
In the case of a directory, care is taken that only one dentry
|
||||
can ever be attached.
|
||||
d_splice_alias(inode, dentry) will make sure that there is a
|
||||
dentry with the same name and parent as the given dentry, and
|
||||
which refers to the given inode.
|
||||
If the inode is a directory and already has a dentry, then that
|
||||
dentry is d_moved over the given dentry.
|
||||
If the passed dentry gets attached, care is taken that this is
|
||||
mutually exclusive to a d_alloc_anon operation.
|
||||
If the passed dentry is used, NULL is returned, else the used
|
||||
dentry is returned. This corresponds to the calling pattern of
|
||||
->lookup.
|
||||
|
||||
|
||||
Filesystem Issues
|
||||
-----------------
|
||||
|
||||
For a filesystem to be exportable it must:
|
||||
|
||||
1/ provide the filehandle fragment routines described below.
|
||||
2/ make sure that d_splice_alias is used rather than d_add
|
||||
when ->lookup finds an inode for a given parent and name.
|
||||
Typically the ->lookup routine will end:
|
||||
if (inode)
|
||||
return d_splice(inode, dentry);
|
||||
d_add(dentry, inode);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
|
||||
|
||||
A file system implementation declares that instances of the filesystem
|
||||
are exportable by setting the s_export_op field in the struct
|
||||
super_block. This field must point to a "struct export_operations"
|
||||
struct which could potentially be full of NULLs, though normally at
|
||||
least get_parent will be set.
|
||||
|
||||
The primary operations are decode_fh and encode_fh.
|
||||
decode_fh takes a filehandle fragment and tries to find or create a
|
||||
dentry for the object referred to by the filehandle.
|
||||
encode_fh takes a dentry and creates a filehandle fragment which can
|
||||
later be used to find/create a dentry for the same object.
|
||||
|
||||
decode_fh will probably make use of "find_exported_dentry".
|
||||
This function lives in the "exportfs" module which a filesystem does
|
||||
not need unless it is being exported. So rather that calling
|
||||
find_exported_dentry directly, each filesystem should call it through
|
||||
the find_exported_dentry pointer in it's export_operations table.
|
||||
This field is set correctly by the exporting agent (e.g. nfsd) when a
|
||||
filesystem is exported, and before any export operations are called.
|
||||
|
||||
find_exported_dentry needs three support functions from the
|
||||
filesystem:
|
||||
get_name. When given a parent dentry and a child dentry, this
|
||||
should find a name in the directory identified by the parent
|
||||
dentry, which leads to the object identified by the child dentry.
|
||||
If no get_name function is supplied, a default implementation is
|
||||
provided which uses vfs_readdir to find potential names, and
|
||||
matches inode numbers to find the correct match.
|
||||
|
||||
get_parent. When given a dentry for a directory, this should return
|
||||
a dentry for the parent. Quite possibly the parent dentry will
|
||||
have been allocated by d_alloc_anon.
|
||||
The default get_parent function just returns an error so any
|
||||
filehandle lookup that requires finding a parent will fail.
|
||||
->lookup("..") is *not* used as a default as it can leave ".."
|
||||
entries in the dcache which are too messy to work with.
|
||||
|
||||
get_dentry. When given an opaque datum, this should find the
|
||||
implied object and create a dentry for it (possibly with
|
||||
d_alloc_anon).
|
||||
The opaque datum is whatever is passed down by the decode_fh
|
||||
function, and is often simply a fragment of the filehandle
|
||||
fragment.
|
||||
decode_fh passes two datums through find_exported_dentry. One that
|
||||
should be used to identify the target object, and one that can be
|
||||
used to identify the object's parent, should that be necessary.
|
||||
The default get_dentry function assumes that the datum contains an
|
||||
inode number and a generation number, and it attempts to get the
|
||||
inode using "iget" and check it's validity by matching the
|
||||
generation number. A filesystem should only depend on the default
|
||||
if iget can safely be used this way.
|
||||
|
||||
If decode_fh and/or encode_fh are left as NULL, then default
|
||||
implementations are used. These defaults are suitable for ext2 and
|
||||
extremely similar filesystems (like ext3).
|
||||
|
||||
The default encode_fh creates a filehandle fragment from the inode
|
||||
number and generation number of the target together with the inode
|
||||
number and generation number of the parent (if the parent is
|
||||
required).
|
||||
|
||||
The default decode_fh extract the target and parent datums from the
|
||||
filehandle assuming the format used by the default encode_fh and
|
||||
passed them to find_exported_dentry.
|
||||
|
||||
|
||||
A filehandle fragment consists of an array of 1 or more 4byte words,
|
||||
together with a one byte "type".
|
||||
The decode_fh routine should not depend on the stated size that is
|
||||
passed to it. This size may be larger than the original filehandle
|
||||
generated by encode_fh, in which case it will have been padded with
|
||||
nuls. Rather, the encode_fh routine should choose a "type" which
|
||||
indicates the decode_fh how much of the filehandle is valid, and how
|
||||
it should be interpreted.
|
||||
|
||||
|
515
Documentation/filesystems/Locking
Normal file
515
Documentation/filesystems/Locking
Normal file
@@ -0,0 +1,515 @@
|
||||
The text below describes the locking rules for VFS-related methods.
|
||||
It is (believed to be) up-to-date. *Please*, if you change anything in
|
||||
prototypes or locking protocols - update this file. And update the relevant
|
||||
instances in the tree, don't leave that to maintainers of filesystems/devices/
|
||||
etc. At the very least, put the list of dubious cases in the end of this file.
|
||||
Don't turn it into log - maintainers of out-of-the-tree code are supposed to
|
||||
be able to use diff(1).
|
||||
Thing currently missing here: socket operations. Alexey?
|
||||
|
||||
--------------------------- dentry_operations --------------------------
|
||||
prototypes:
|
||||
int (*d_revalidate)(struct dentry *, int);
|
||||
int (*d_hash) (struct dentry *, struct qstr *);
|
||||
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
|
||||
int (*d_delete)(struct dentry *);
|
||||
void (*d_release)(struct dentry *);
|
||||
void (*d_iput)(struct dentry *, struct inode *);
|
||||
|
||||
locking rules:
|
||||
none have BKL
|
||||
dcache_lock rename_lock ->d_lock may block
|
||||
d_revalidate: no no no yes
|
||||
d_hash no no no yes
|
||||
d_compare: no yes no no
|
||||
d_delete: yes no yes no
|
||||
d_release: no no no yes
|
||||
d_iput: no no no yes
|
||||
|
||||
--------------------------- inode_operations ---------------------------
|
||||
prototypes:
|
||||
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
|
||||
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid
|
||||
ata *);
|
||||
int (*link) (struct dentry *,struct inode *,struct dentry *);
|
||||
int (*unlink) (struct inode *,struct dentry *);
|
||||
int (*symlink) (struct inode *,struct dentry *,const char *);
|
||||
int (*mkdir) (struct inode *,struct dentry *,int);
|
||||
int (*rmdir) (struct inode *,struct dentry *);
|
||||
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
|
||||
int (*rename) (struct inode *, struct dentry *,
|
||||
struct inode *, struct dentry *);
|
||||
int (*readlink) (struct dentry *, char __user *,int);
|
||||
int (*follow_link) (struct dentry *, struct nameidata *);
|
||||
void (*truncate) (struct inode *);
|
||||
int (*permission) (struct inode *, int, struct nameidata *);
|
||||
int (*setattr) (struct dentry *, struct iattr *);
|
||||
int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
|
||||
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
|
||||
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
|
||||
ssize_t (*listxattr) (struct dentry *, char *, size_t);
|
||||
int (*removexattr) (struct dentry *, const char *);
|
||||
|
||||
locking rules:
|
||||
all may block, none have BKL
|
||||
i_sem(inode)
|
||||
lookup: yes
|
||||
create: yes
|
||||
link: yes (both)
|
||||
mknod: yes
|
||||
symlink: yes
|
||||
mkdir: yes
|
||||
unlink: yes (both)
|
||||
rmdir: yes (both) (see below)
|
||||
rename: yes (all) (see below)
|
||||
readlink: no
|
||||
follow_link: no
|
||||
truncate: yes (see below)
|
||||
setattr: yes
|
||||
permission: no
|
||||
getattr: no
|
||||
setxattr: yes
|
||||
getxattr: no
|
||||
listxattr: no
|
||||
removexattr: yes
|
||||
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
|
||||
victim.
|
||||
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
|
||||
->truncate() is never called directly - it's a callback, not a
|
||||
method. It's called by vmtruncate() - library function normally used by
|
||||
->setattr(). Locking information above applies to that call (i.e. is
|
||||
inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
|
||||
passed).
|
||||
|
||||
See Documentation/filesystems/directory-locking for more detailed discussion
|
||||
of the locking scheme for directory operations.
|
||||
|
||||
--------------------------- super_operations ---------------------------
|
||||
prototypes:
|
||||
struct inode *(*alloc_inode)(struct super_block *sb);
|
||||
void (*destroy_inode)(struct inode *);
|
||||
void (*read_inode) (struct inode *);
|
||||
void (*dirty_inode) (struct inode *);
|
||||
int (*write_inode) (struct inode *, int);
|
||||
void (*put_inode) (struct inode *);
|
||||
void (*drop_inode) (struct inode *);
|
||||
void (*delete_inode) (struct inode *);
|
||||
void (*put_super) (struct super_block *);
|
||||
void (*write_super) (struct super_block *);
|
||||
int (*sync_fs)(struct super_block *sb, int wait);
|
||||
void (*write_super_lockfs) (struct super_block *);
|
||||
void (*unlockfs) (struct super_block *);
|
||||
int (*statfs) (struct super_block *, struct kstatfs *);
|
||||
int (*remount_fs) (struct super_block *, int *, char *);
|
||||
void (*clear_inode) (struct inode *);
|
||||
void (*umount_begin) (struct super_block *);
|
||||
int (*show_options)(struct seq_file *, struct vfsmount *);
|
||||
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
|
||||
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
|
||||
|
||||
locking rules:
|
||||
All may block.
|
||||
BKL s_lock s_umount
|
||||
alloc_inode: no no no
|
||||
destroy_inode: no
|
||||
read_inode: no (see below)
|
||||
dirty_inode: no (must not sleep)
|
||||
write_inode: no
|
||||
put_inode: no
|
||||
drop_inode: no !!!inode_lock!!!
|
||||
delete_inode: no
|
||||
put_super: yes yes no
|
||||
write_super: no yes read
|
||||
sync_fs: no no read
|
||||
write_super_lockfs: ?
|
||||
unlockfs: ?
|
||||
statfs: no no no
|
||||
remount_fs: no yes maybe (see below)
|
||||
clear_inode: no
|
||||
umount_begin: yes no no
|
||||
show_options: no (vfsmount->sem)
|
||||
quota_read: no no no (see below)
|
||||
quota_write: no no no (see below)
|
||||
|
||||
->read_inode() is not a method - it's a callback used in iget().
|
||||
->remount_fs() will have the s_umount lock if it's already mounted.
|
||||
When called from get_sb_single, it does NOT have the s_umount lock.
|
||||
->quota_read() and ->quota_write() functions are both guaranteed to
|
||||
be the only ones operating on the quota file by the quota code (via
|
||||
dqio_sem) (unless an admin really wants to screw up something and
|
||||
writes to quota files with quotas on). For other details about locking
|
||||
see also dquot_operations section.
|
||||
|
||||
--------------------------- file_system_type ---------------------------
|
||||
prototypes:
|
||||
struct super_block *(*get_sb) (struct file_system_type *, int,
|
||||
const char *, void *);
|
||||
void (*kill_sb) (struct super_block *);
|
||||
locking rules:
|
||||
may block BKL
|
||||
get_sb yes yes
|
||||
kill_sb yes yes
|
||||
|
||||
->get_sb() returns error or a locked superblock (exclusive on ->s_umount).
|
||||
->kill_sb() takes a write-locked superblock, does all shutdown work on it,
|
||||
unlocks and drops the reference.
|
||||
|
||||
--------------------------- address_space_operations --------------------------
|
||||
prototypes:
|
||||
int (*writepage)(struct page *page, struct writeback_control *wbc);
|
||||
int (*readpage)(struct file *, struct page *);
|
||||
int (*sync_page)(struct page *);
|
||||
int (*writepages)(struct address_space *, struct writeback_control *);
|
||||
int (*set_page_dirty)(struct page *page);
|
||||
int (*readpages)(struct file *filp, struct address_space *mapping,
|
||||
struct list_head *pages, unsigned nr_pages);
|
||||
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
|
||||
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
|
||||
sector_t (*bmap)(struct address_space *, sector_t);
|
||||
int (*invalidatepage) (struct page *, unsigned long);
|
||||
int (*releasepage) (struct page *, int);
|
||||
int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
|
||||
loff_t offset, unsigned long nr_segs);
|
||||
|
||||
locking rules:
|
||||
All except set_page_dirty may block
|
||||
|
||||
BKL PageLocked(page)
|
||||
writepage: no yes, unlocks (see below)
|
||||
readpage: no yes, unlocks
|
||||
sync_page: no maybe
|
||||
writepages: no
|
||||
set_page_dirty no no
|
||||
readpages: no
|
||||
prepare_write: no yes
|
||||
commit_write: no yes
|
||||
bmap: yes
|
||||
invalidatepage: no yes
|
||||
releasepage: no yes
|
||||
direct_IO: no
|
||||
|
||||
->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
|
||||
may be called from the request handler (/dev/loop).
|
||||
|
||||
->readpage() unlocks the page, either synchronously or via I/O
|
||||
completion.
|
||||
|
||||
->readpages() populates the pagecache with the passed pages and starts
|
||||
I/O against them. They come unlocked upon I/O completion.
|
||||
|
||||
->writepage() is used for two purposes: for "memory cleansing" and for
|
||||
"sync". These are quite different operations and the behaviour may differ
|
||||
depending upon the mode.
|
||||
|
||||
If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
|
||||
it *must* start I/O against the page, even if that would involve
|
||||
blocking on in-progress I/O.
|
||||
|
||||
If writepage is called for memory cleansing (sync_mode ==
|
||||
WBC_SYNC_NONE) then its role is to get as much writeout underway as
|
||||
possible. So writepage should try to avoid blocking against
|
||||
currently-in-progress I/O.
|
||||
|
||||
If the filesystem is not called for "sync" and it determines that it
|
||||
would need to block against in-progress I/O to be able to start new I/O
|
||||
against the page the filesystem should redirty the page with
|
||||
redirty_page_for_writepage(), then unlock the page and return zero.
|
||||
This may also be done to avoid internal deadlocks, but rarely.
|
||||
|
||||
If the filesytem is called for sync then it must wait on any
|
||||
in-progress I/O and then start new I/O.
|
||||
|
||||
The filesystem should unlock the page synchronously, before returning
|
||||
to the caller.
|
||||
|
||||
Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
|
||||
and return zero, writepage *must* run set_page_writeback() against the page,
|
||||
followed by unlocking it. Once set_page_writeback() has been run against the
|
||||
page, write I/O can be submitted and the write I/O completion handler must run
|
||||
end_page_writeback() once the I/O is complete. If no I/O is submitted, the
|
||||
filesystem must run end_page_writeback() against the page before returning from
|
||||
writepage.
|
||||
|
||||
That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
|
||||
if the filesystem needs the page to be locked during writeout, that is ok, too,
|
||||
the page is allowed to be unlocked at any point in time between the calls to
|
||||
set_page_writeback() and end_page_writeback().
|
||||
|
||||
Note, failure to run either redirty_page_for_writepage() or the combination of
|
||||
set_page_writeback()/end_page_writeback() on a page submitted to writepage
|
||||
will leave the page itself marked clean but it will be tagged as dirty in the
|
||||
radix tree. This incoherency can lead to all sorts of hard-to-debug problems
|
||||
in the filesystem like having dirty inodes at umount and losing written data.
|
||||
|
||||
->sync_page() locking rules are not well-defined - usually it is called
|
||||
with lock on page, but that is not guaranteed. Considering the currently
|
||||
existing instances of this method ->sync_page() itself doesn't look
|
||||
well-defined...
|
||||
|
||||
->writepages() is used for periodic writeback and for syscall-initiated
|
||||
sync operations. The address_space should start I/O against at least
|
||||
*nr_to_write pages. *nr_to_write must be decremented for each page which is
|
||||
written. The address_space implementation may write more (or less) pages
|
||||
than *nr_to_write asks for, but it should try to be reasonably close. If
|
||||
nr_to_write is NULL, all dirty pages must be written.
|
||||
|
||||
writepages should _only_ write pages which are present on
|
||||
mapping->io_pages.
|
||||
|
||||
->set_page_dirty() is called from various places in the kernel
|
||||
when the target page is marked as needing writeback. It may be called
|
||||
under spinlock (it cannot block) and is sometimes called with the page
|
||||
not locked.
|
||||
|
||||
->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
|
||||
filesystems and by the swapper. The latter will eventually go away. All
|
||||
instances do not actually need the BKL. Please, keep it that way and don't
|
||||
breed new callers.
|
||||
|
||||
->invalidatepage() is called when the filesystem must attempt to drop
|
||||
some or all of the buffers from the page when it is being truncated. It
|
||||
returns zero on success. If ->invalidatepage is zero, the kernel uses
|
||||
block_invalidatepage() instead.
|
||||
|
||||
->releasepage() is called when the kernel is about to try to drop the
|
||||
buffers from the page in preparation for freeing it. It returns zero to
|
||||
indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
|
||||
the kernel assumes that the fs has no private interest in the buffers.
|
||||
|
||||
Note: currently almost all instances of address_space methods are
|
||||
using BKL for internal serialization and that's one of the worst sources
|
||||
of contention. Normally they are calling library functions (in fs/buffer.c)
|
||||
and pass foo_get_block() as a callback (on local block-based filesystems,
|
||||
indeed). BKL is not needed for library stuff and is usually taken by
|
||||
foo_get_block(). It's an overkill, since block bitmaps can be protected by
|
||||
internal fs locking and real critical areas are much smaller than the areas
|
||||
filesystems protect now.
|
||||
|
||||
----------------------- file_lock_operations ------------------------------
|
||||
prototypes:
|
||||
void (*fl_insert)(struct file_lock *); /* lock insertion callback */
|
||||
void (*fl_remove)(struct file_lock *); /* lock removal callback */
|
||||
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
|
||||
void (*fl_release_private)(struct file_lock *);
|
||||
|
||||
|
||||
locking rules:
|
||||
BKL may block
|
||||
fl_insert: yes no
|
||||
fl_remove: yes no
|
||||
fl_copy_lock: yes no
|
||||
fl_release_private: yes yes
|
||||
|
||||
----------------------- lock_manager_operations ---------------------------
|
||||
prototypes:
|
||||
int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
|
||||
void (*fl_notify)(struct file_lock *); /* unblock callback */
|
||||
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
|
||||
void (*fl_release_private)(struct file_lock *);
|
||||
void (*fl_break)(struct file_lock *); /* break_lease callback */
|
||||
|
||||
locking rules:
|
||||
BKL may block
|
||||
fl_compare_owner: yes no
|
||||
fl_notify: yes no
|
||||
fl_copy_lock: yes no
|
||||
fl_release_private: yes yes
|
||||
fl_break: yes no
|
||||
|
||||
Currently only NFSD and NLM provide instances of this class. None of the
|
||||
them block. If you have out-of-tree instances - please, show up. Locking
|
||||
in that area will change.
|
||||
--------------------------- buffer_head -----------------------------------
|
||||
prototypes:
|
||||
void (*b_end_io)(struct buffer_head *bh, int uptodate);
|
||||
|
||||
locking rules:
|
||||
called from interrupts. In other words, extreme care is needed here.
|
||||
bh is locked, but that's all warranties we have here. Currently only RAID1,
|
||||
highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
|
||||
call this method upon the IO completion.
|
||||
|
||||
--------------------------- block_device_operations -----------------------
|
||||
prototypes:
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
|
||||
int (*media_changed) (struct gendisk *);
|
||||
int (*revalidate_disk) (struct gendisk *);
|
||||
|
||||
locking rules:
|
||||
BKL bd_sem
|
||||
open: yes yes
|
||||
release: yes yes
|
||||
ioctl: yes no
|
||||
media_changed: no no
|
||||
revalidate_disk: no no
|
||||
|
||||
The last two are called only from check_disk_change().
|
||||
|
||||
--------------------------- file_operations -------------------------------
|
||||
prototypes:
|
||||
loff_t (*llseek) (struct file *, loff_t, int);
|
||||
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
|
||||
ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
|
||||
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
|
||||
ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t,
|
||||
loff_t);
|
||||
int (*readdir) (struct file *, void *, filldir_t);
|
||||
unsigned int (*poll) (struct file *, struct poll_table_struct *);
|
||||
int (*ioctl) (struct inode *, struct file *, unsigned int,
|
||||
unsigned long);
|
||||
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
|
||||
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
|
||||
int (*mmap) (struct file *, struct vm_area_struct *);
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*flush) (struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*fsync) (struct file *, struct dentry *, int datasync);
|
||||
int (*aio_fsync) (struct kiocb *, int datasync);
|
||||
int (*fasync) (int, struct file *, int);
|
||||
int (*lock) (struct file *, int, struct file_lock *);
|
||||
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
|
||||
loff_t *);
|
||||
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
|
||||
loff_t *);
|
||||
ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
|
||||
void __user *);
|
||||
ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
|
||||
loff_t *, int);
|
||||
unsigned long (*get_unmapped_area)(struct file *, unsigned long,
|
||||
unsigned long, unsigned long, unsigned long);
|
||||
int (*check_flags)(int);
|
||||
int (*dir_notify)(struct file *, unsigned long);
|
||||
};
|
||||
|
||||
locking rules:
|
||||
All except ->poll() may block.
|
||||
BKL
|
||||
llseek: no (see below)
|
||||
read: no
|
||||
aio_read: no
|
||||
write: no
|
||||
aio_write: no
|
||||
readdir: no
|
||||
poll: no
|
||||
ioctl: yes (see below)
|
||||
unlocked_ioctl: no (see below)
|
||||
compat_ioctl: no
|
||||
mmap: no
|
||||
open: maybe (see below)
|
||||
flush: no
|
||||
release: no
|
||||
fsync: no (see below)
|
||||
aio_fsync: no
|
||||
fasync: yes (see below)
|
||||
lock: yes
|
||||
readv: no
|
||||
writev: no
|
||||
sendfile: no
|
||||
sendpage: no
|
||||
get_unmapped_area: no
|
||||
check_flags: no
|
||||
dir_notify: no
|
||||
|
||||
->llseek() locking has moved from llseek to the individual llseek
|
||||
implementations. If your fs is not using generic_file_llseek, you
|
||||
need to acquire and release the appropriate locks in your ->llseek().
|
||||
For many filesystems, it is probably safe to acquire the inode
|
||||
semaphore. Note some filesystems (i.e. remote ones) provide no
|
||||
protection for i_size so you will need to use the BKL.
|
||||
|
||||
->open() locking is in-transit: big lock partially moved into the methods.
|
||||
The only exception is ->open() in the instances of file_operations that never
|
||||
end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices
|
||||
(chrdev_open() takes lock before replacing ->f_op and calling the secondary
|
||||
method. As soon as we fix the handling of module reference counters all
|
||||
instances of ->open() will be called without the BKL.
|
||||
|
||||
Note: ext2_release() was *the* source of contention on fs-intensive
|
||||
loads and dropping BKL on ->release() helps to get rid of that (we still
|
||||
grab BKL for cases when we close a file that had been opened r/w, but that
|
||||
can and should be done using the internal locking with smaller critical areas).
|
||||
Current worst offender is ext2_get_block()...
|
||||
|
||||
->fasync() is a mess. This area needs a big cleanup and that will probably
|
||||
affect locking.
|
||||
|
||||
->readdir() and ->ioctl() on directories must be changed. Ideally we would
|
||||
move ->readdir() to inode_operations and use a separate method for directory
|
||||
->ioctl() or kill the latter completely. One of the problems is that for
|
||||
anything that resembles union-mount we won't have a struct file for all
|
||||
components. And there are other reasons why the current interface is a mess...
|
||||
|
||||
->ioctl() on regular files is superceded by the ->unlocked_ioctl() that
|
||||
doesn't take the BKL.
|
||||
|
||||
->read on directories probably must go away - we should just enforce -EISDIR
|
||||
in sys_read() and friends.
|
||||
|
||||
->fsync() has i_sem on inode.
|
||||
|
||||
--------------------------- dquot_operations -------------------------------
|
||||
prototypes:
|
||||
int (*initialize) (struct inode *, int);
|
||||
int (*drop) (struct inode *);
|
||||
int (*alloc_space) (struct inode *, qsize_t, int);
|
||||
int (*alloc_inode) (const struct inode *, unsigned long);
|
||||
int (*free_space) (struct inode *, qsize_t);
|
||||
int (*free_inode) (const struct inode *, unsigned long);
|
||||
int (*transfer) (struct inode *, struct iattr *);
|
||||
int (*write_dquot) (struct dquot *);
|
||||
int (*acquire_dquot) (struct dquot *);
|
||||
int (*release_dquot) (struct dquot *);
|
||||
int (*mark_dirty) (struct dquot *);
|
||||
int (*write_info) (struct super_block *, int);
|
||||
|
||||
These operations are intended to be more or less wrapping functions that ensure
|
||||
a proper locking wrt the filesystem and call the generic quota operations.
|
||||
|
||||
What filesystem should expect from the generic quota functions:
|
||||
|
||||
FS recursion Held locks when called
|
||||
initialize: yes maybe dqonoff_sem
|
||||
drop: yes -
|
||||
alloc_space: ->mark_dirty() -
|
||||
alloc_inode: ->mark_dirty() -
|
||||
free_space: ->mark_dirty() -
|
||||
free_inode: ->mark_dirty() -
|
||||
transfer: yes -
|
||||
write_dquot: yes dqonoff_sem or dqptr_sem
|
||||
acquire_dquot: yes dqonoff_sem or dqptr_sem
|
||||
release_dquot: yes dqonoff_sem or dqptr_sem
|
||||
mark_dirty: no -
|
||||
write_info: yes dqonoff_sem
|
||||
|
||||
FS recursion means calling ->quota_read() and ->quota_write() from superblock
|
||||
operations.
|
||||
|
||||
->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
|
||||
only directly by the filesystem and do not call any fs functions only
|
||||
the ->mark_dirty() operation.
|
||||
|
||||
More details about quota locking can be found in fs/dquot.c.
|
||||
|
||||
--------------------------- vm_operations_struct -----------------------------
|
||||
prototypes:
|
||||
void (*open)(struct vm_area_struct*);
|
||||
void (*close)(struct vm_area_struct*);
|
||||
struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
|
||||
|
||||
locking rules:
|
||||
BKL mmap_sem
|
||||
open: no yes
|
||||
close: no yes
|
||||
nopage: no yes
|
||||
|
||||
================================================================================
|
||||
Dubious stuff
|
||||
|
||||
(if you break something or notice that it is broken and do not fix it yourself
|
||||
- at least put it here)
|
||||
|
||||
ipc/shm.c::shm_delete() - may need BKL.
|
||||
->read() and ->write() in many drivers are (probably) missing BKL.
|
||||
drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.
|
57
Documentation/filesystems/adfs.txt
Normal file
57
Documentation/filesystems/adfs.txt
Normal file
@@ -0,0 +1,57 @@
|
||||
Mount options for ADFS
|
||||
----------------------
|
||||
|
||||
uid=nnn All files in the partition will be owned by
|
||||
user id nnn. Default 0 (root).
|
||||
gid=nnn All files in the partition willbe in group
|
||||
nnn. Default 0 (root).
|
||||
ownmask=nnn The permission mask for ADFS 'owner' permissions
|
||||
will be nnn. Default 0700.
|
||||
othmask=nnn The permission mask for ADFS 'other' permissions
|
||||
will be nnn. Default 0077.
|
||||
|
||||
Mapping of ADFS permissions to Linux permissions
|
||||
------------------------------------------------
|
||||
|
||||
ADFS permissions consist of the following:
|
||||
|
||||
Owner read
|
||||
Owner write
|
||||
Other read
|
||||
Other write
|
||||
|
||||
(In older versions, an 'execute' permission did exist, but this
|
||||
does not hold the same meaning as the Linux 'execute' permission
|
||||
and is now obsolete).
|
||||
|
||||
The mapping is performed as follows:
|
||||
|
||||
Owner read -> -r--r--r--
|
||||
Owner write -> --w--w---w
|
||||
Owner read and filetype UnixExec -> ---x--x--x
|
||||
These are then masked by ownmask, eg 700 -> -rwx------
|
||||
Possible owner mode permissions -> -rwx------
|
||||
|
||||
Other read -> -r--r--r--
|
||||
Other write -> --w--w--w-
|
||||
Other read and filetype UnixExec -> ---x--x--x
|
||||
These are then masked by othmask, eg 077 -> ----rwxrwx
|
||||
Possible other mode permissions -> ----rwxrwx
|
||||
|
||||
Hence, with the default masks, if a file is owner read/write, and
|
||||
not a UnixExec filetype, then the permissions will be:
|
||||
|
||||
-rw-------
|
||||
|
||||
However, if the masks were ownmask=0770,othmask=0007, then this would
|
||||
be modified to:
|
||||
-rw-rw----
|
||||
|
||||
There is no restriction on what you can do with these masks. You may
|
||||
wish that either read bits give read access to the file for all, but
|
||||
keep the default write protection (ownmask=0755,othmask=0577):
|
||||
|
||||
-rw-r--r--
|
||||
|
||||
You can therefore tailor the permission translation to whatever you
|
||||
desire the permissions should be under Linux.
|
219
Documentation/filesystems/affs.txt
Normal file
219
Documentation/filesystems/affs.txt
Normal file
@@ -0,0 +1,219 @@
|
||||
Overview of Amiga Filesystems
|
||||
=============================
|
||||
|
||||
Not all varieties of the Amiga filesystems are supported for reading and
|
||||
writing. The Amiga currently knows six different filesystems:
|
||||
|
||||
DOS\0 The old or original filesystem, not really suited for
|
||||
hard disks and normally not used on them, either.
|
||||
Supported read/write.
|
||||
|
||||
DOS\1 The original Fast File System. Supported read/write.
|
||||
|
||||
DOS\2 The old "international" filesystem. International means that
|
||||
a bug has been fixed so that accented ("international") letters
|
||||
in file names are case-insensitive, as they ought to be.
|
||||
Supported read/write.
|
||||
|
||||
DOS\3 The "international" Fast File System. Supported read/write.
|
||||
|
||||
DOS\4 The original filesystem with directory cache. The directory
|
||||
cache speeds up directory accesses on floppies considerably,
|
||||
but slows down file creation/deletion. Doesn't make much
|
||||
sense on hard disks. Supported read only.
|
||||
|
||||
DOS\5 The Fast File System with directory cache. Supported read only.
|
||||
|
||||
All of the above filesystems allow block sizes from 512 to 32K bytes.
|
||||
Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
|
||||
speed up almost everything at the expense of wasted disk space. The speed
|
||||
gain above 4K seems not really worth the price, so you don't lose too
|
||||
much here, either.
|
||||
|
||||
The muFS (multi user File System) equivalents of the above file systems
|
||||
are supported, too.
|
||||
|
||||
Mount options for the AFFS
|
||||
==========================
|
||||
|
||||
protect If this option is set, the protection bits cannot be altered.
|
||||
|
||||
setuid[=uid] This sets the owner of all files and directories in the file
|
||||
system to uid or the uid of the current user, respectively.
|
||||
|
||||
setgid[=gid] Same as above, but for gid.
|
||||
|
||||
mode=mode Sets the mode flags to the given (octal) value, regardless
|
||||
of the original permissions. Directories will get an x
|
||||
permission if the corresponding r bit is set.
|
||||
This is useful since most of the plain AmigaOS files
|
||||
will map to 600.
|
||||
|
||||
reserved=num Sets the number of reserved blocks at the start of the
|
||||
partition to num. You should never need this option.
|
||||
Default is 2.
|
||||
|
||||
root=block Sets the block number of the root block. This should never
|
||||
be necessary.
|
||||
|
||||
bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
|
||||
1024, 2048 and 4096. Like the root option, this should
|
||||
never be necessary, as the affs can figure it out itself.
|
||||
|
||||
quiet The file system will not return an error for disallowed
|
||||
mode changes.
|
||||
|
||||
verbose The volume name, file system type and block size will
|
||||
be written to the syslog when the filesystem is mounted.
|
||||
|
||||
mufs The filesystem is really a muFS, also it doesn't
|
||||
identify itself as one. This option is necessary if
|
||||
the filesystem wasn't formatted as muFS, but is used
|
||||
as one.
|
||||
|
||||
prefix=path Path will be prefixed to every absolute path name of
|
||||
symbolic links on an AFFS partition. Default = "/".
|
||||
(See below.)
|
||||
|
||||
volume=name When symbolic links with an absolute path are created
|
||||
on an AFFS partition, name will be prepended as the
|
||||
volume name. Default = "" (empty string).
|
||||
(See below.)
|
||||
|
||||
Handling of the Users/Groups and protection flags
|
||||
=================================================
|
||||
|
||||
Amiga -> Linux:
|
||||
|
||||
The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
|
||||
|
||||
- R maps to r for user, group and others. On directories, R implies x.
|
||||
|
||||
- If both W and D are allowed, w will be set.
|
||||
|
||||
- E maps to x.
|
||||
|
||||
- H and P are always retained and ignored under Linux.
|
||||
|
||||
- A is always reset when a file is written to.
|
||||
|
||||
User id and group id will be used unless set[gu]id are given as mount
|
||||
options. Since most of the Amiga file systems are single user systems
|
||||
they will be owned by root. The root directory (the mount point) of the
|
||||
Amiga filesystem will be owned by the user who actually mounts the
|
||||
filesystem (the root directory doesn't have uid/gid fields).
|
||||
|
||||
Linux -> Amiga:
|
||||
|
||||
The Linux rwxrwxrwx file mode is handled as follows:
|
||||
|
||||
- r permission will set R for user, group and others.
|
||||
|
||||
- w permission will set W and D for user, group and others.
|
||||
|
||||
- x permission of the user will set E for plain files.
|
||||
|
||||
- All other flags (suid, sgid, ...) are ignored and will
|
||||
not be retained.
|
||||
|
||||
Newly created files and directories will get the user and group ID
|
||||
of the current user and a mode according to the umask.
|
||||
|
||||
Symbolic links
|
||||
==============
|
||||
|
||||
Although the Amiga and Linux file systems resemble each other, there
|
||||
are some, not always subtle, differences. One of them becomes apparent
|
||||
with symbolic links. While Linux has a file system with exactly one
|
||||
root directory, the Amiga has a separate root directory for each
|
||||
file system (for example, partition, floppy disk, ...). With the Amiga,
|
||||
these entities are called "volumes". They have symbolic names which
|
||||
can be used to access them. Thus, symbolic links can point to a
|
||||
different volume. AFFS turns the volume name into a directory name
|
||||
and prepends the prefix path (see prefix option) to it.
|
||||
|
||||
Example:
|
||||
You mount all your Amiga partitions under /amiga/<volume> (where
|
||||
<volume> is the name of the volume), and you give the option
|
||||
"prefix=/amiga/" when mounting all your AFFS partitions. (They
|
||||
might be "User", "WB" and "Graphics", the mount points /amiga/User,
|
||||
/amiga/WB and /amiga/Graphics). A symbolic link referring to
|
||||
"User:sc/include/dos/dos.h" will be followed to
|
||||
"/amiga/User/sc/include/dos/dos.h".
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Command line:
|
||||
mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
|
||||
mount /dev/sda3 /Amiga -t affs
|
||||
|
||||
/etc/fstab entry:
|
||||
/dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
|
||||
|
||||
IMPORTANT NOTE
|
||||
==============
|
||||
|
||||
If you boot Windows 95 (don't know about 3.x, 98 and NT) while you
|
||||
have an Amiga harddisk connected to your PC, it will overwrite
|
||||
the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating
|
||||
the Rigid Disk Block. Sheer luck has it that this is an unused
|
||||
area of the RDB, so only the checksum doesn't match anymore.
|
||||
Linux will ignore this garbage and recognize the RDB anyway, but
|
||||
before you connect that drive to your Amiga again, you must
|
||||
restore or repair your RDB. So please do make a backup copy of it
|
||||
before booting Windows!
|
||||
|
||||
If the damage is already done, the following should fix the RDB
|
||||
(where <disk> is the device name).
|
||||
DO AT YOUR OWN RISK:
|
||||
|
||||
dd if=/dev/<disk> of=rdb.tmp count=1
|
||||
cp rdb.tmp rdb.fixed
|
||||
dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4
|
||||
dd if=rdb.fixed of=/dev/<disk>
|
||||
|
||||
Bugs, Restrictions, Caveats
|
||||
===========================
|
||||
|
||||
Quite a few things may not work as advertised. Not everything is
|
||||
tested, though several hundred MB have been read and written using
|
||||
this fs. For a most up-to-date list of bugs please consult
|
||||
fs/affs/Changes.
|
||||
|
||||
Filenames are truncated to 30 characters without warning (this
|
||||
can be changed by setting the compile-time option AFFS_NO_TRUNCATE
|
||||
in include/linux/amigaffs.h).
|
||||
|
||||
Case is ignored by the affs in filename matching, but Linux shells
|
||||
do care about the case. Example (with /wb being an affs mounted fs):
|
||||
rm /wb/WRONGCASE
|
||||
will remove /mnt/wrongcase, but
|
||||
rm /wb/WR*
|
||||
will not since the names are matched by the shell.
|
||||
|
||||
The block allocation is designed for hard disk partitions. If more
|
||||
than 1 process writes to a (small) diskette, the blocks are allocated
|
||||
in an ugly way (but the real AFFS doesn't do much better). This
|
||||
is also true when space gets tight.
|
||||
|
||||
You cannot execute programs on an OFS (Old File System), since the
|
||||
program files cannot be memory mapped due to the 488 byte blocks.
|
||||
For the same reason you cannot mount an image on such a filesystem
|
||||
via the loopback device.
|
||||
|
||||
The bitmap valid flag in the root block may not be accurate when the
|
||||
system crashes while an affs partition is mounted. There's currently
|
||||
no way to fix a garbled filesystem without an Amiga (disk validator)
|
||||
or manually (who would do this?). Maybe later.
|
||||
|
||||
If you mount affs partitions on system startup, you may want to tell
|
||||
fsck that the fs should not be checked (place a '0' in the sixth field
|
||||
of /etc/fstab).
|
||||
|
||||
It's not possible to read floppy disks with a normal PC or workstation
|
||||
due to an incompatibility with the Amiga floppy controller.
|
||||
|
||||
If you are interested in an Amiga Emulator for Linux, look at
|
||||
|
||||
http://www-users.informatik.rwth-aachen.de/~crux/uae.html
|
155
Documentation/filesystems/afs.txt
Normal file
155
Documentation/filesystems/afs.txt
Normal file
@@ -0,0 +1,155 @@
|
||||
kAFS: AFS FILESYSTEM
|
||||
====================
|
||||
|
||||
ABOUT
|
||||
=====
|
||||
|
||||
This filesystem provides a fairly simple AFS filesystem driver. It is under
|
||||
development and only provides very basic facilities. It does not yet support
|
||||
the following AFS features:
|
||||
|
||||
(*) Write support.
|
||||
(*) Communications security.
|
||||
(*) Local caching.
|
||||
(*) pioctl() system call.
|
||||
(*) Automatic mounting of embedded mountpoints.
|
||||
|
||||
|
||||
USAGE
|
||||
=====
|
||||
|
||||
When inserting the driver modules the root cell must be specified along with a
|
||||
list of volume location server IP addresses:
|
||||
|
||||
insmod rxrpc.o
|
||||
insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
|
||||
|
||||
The first module is a driver for the RxRPC remote operation protocol, and the
|
||||
second is the actual filesystem driver for the AFS filesystem.
|
||||
|
||||
Once the module has been loaded, more modules can be added by the following
|
||||
procedure:
|
||||
|
||||
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
|
||||
|
||||
Where the parameters to the "add" command are the name of a cell and a list of
|
||||
volume location servers within that cell.
|
||||
|
||||
Filesystems can be mounted anywhere by commands similar to the following:
|
||||
|
||||
mount -t afs "%cambridge.redhat.com:root.afs." /afs
|
||||
mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
|
||||
mount -t afs "#root.afs." /afs
|
||||
mount -t afs "#root.cell." /afs/cambridge
|
||||
|
||||
NB: When using this on Linux 2.4, the mount command has to be different,
|
||||
since the filesystem doesn't have access to the device name argument:
|
||||
|
||||
mount -t afs none /afs -ovol="#root.afs."
|
||||
|
||||
Where the initial character is either a hash or a percent symbol depending on
|
||||
whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
|
||||
volume, but are willing to use a R/W volume instead (percent).
|
||||
|
||||
The name of the volume can be suffixes with ".backup" or ".readonly" to
|
||||
specify connection to only volumes of those types.
|
||||
|
||||
The name of the cell is optional, and if not given during a mount, then the
|
||||
named volume will be looked up in the cell specified during insmod.
|
||||
|
||||
Additional cells can be added through /proc (see later section).
|
||||
|
||||
|
||||
MOUNTPOINTS
|
||||
===========
|
||||
|
||||
AFS has a concept of mountpoints. These are specially formatted symbolic links
|
||||
(of the same form as the "device name" passed to mount). kAFS presents these
|
||||
to the user as directories that have special properties:
|
||||
|
||||
(*) They cannot be listed. Running a program like "ls" on them will incur an
|
||||
EREMOTE error (Object is remote).
|
||||
|
||||
(*) Other objects can't be looked up inside of them. This also incurs an
|
||||
EREMOTE error.
|
||||
|
||||
(*) They can be queried with the readlink() system call, which will return
|
||||
the name of the mountpoint to which they point. The "readlink" program
|
||||
will also work.
|
||||
|
||||
(*) They can be mounted on (which symbolic links can't).
|
||||
|
||||
|
||||
PROC FILESYSTEM
|
||||
===============
|
||||
|
||||
The rxrpc module creates a number of files in various places in the /proc
|
||||
filesystem:
|
||||
|
||||
(*) Firstly, some information files are made available in a directory called
|
||||
"/proc/net/rxrpc/". These list the extant transport endpoint, peer,
|
||||
connection and call records.
|
||||
|
||||
(*) Secondly, some control files are made available in a directory called
|
||||
"/proc/sys/rxrpc/". Currently, all these files can be used for is to
|
||||
turn on various levels of tracing.
|
||||
|
||||
The AFS modules creates a "/proc/fs/afs/" directory and populates it:
|
||||
|
||||
(*) A "cells" file that lists cells currently known to the afs module.
|
||||
|
||||
(*) A directory per cell that contains files that list volume location
|
||||
servers, volumes, and active servers known within that cell.
|
||||
|
||||
|
||||
THE CELL DATABASE
|
||||
=================
|
||||
|
||||
The filesystem maintains an internal database of all the cells it knows and
|
||||
the IP addresses of the volume location servers for those cells. The cell to
|
||||
which the computer belongs is added to the database when insmod is performed
|
||||
by the "rootcell=" argument.
|
||||
|
||||
Further cells can be added by commands similar to the following:
|
||||
|
||||
echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
|
||||
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
|
||||
|
||||
No other cell database operations are available at this time.
|
||||
|
||||
|
||||
EXAMPLES
|
||||
========
|
||||
|
||||
Here's what I use to test this. Some of the names and IP addresses are local
|
||||
to my internal DNS. My "root.afs" partition has a mount point within it for
|
||||
some public volumes volumes.
|
||||
|
||||
insmod -S /tmp/rxrpc.o
|
||||
insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
|
||||
|
||||
mount -t afs \%root.afs. /afs
|
||||
mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
|
||||
|
||||
echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells
|
||||
mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
|
||||
mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
|
||||
mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
|
||||
mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc
|
||||
mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project
|
||||
mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
|
||||
mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
|
||||
mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
|
||||
|
||||
umount /afs/grand.central.org/user
|
||||
umount /afs/grand.central.org/software
|
||||
umount /afs/grand.central.org/service
|
||||
umount /afs/grand.central.org/project
|
||||
umount /afs/grand.central.org/doc
|
||||
umount /afs/grand.central.org/contrib
|
||||
umount /afs/grand.central.org/archive
|
||||
umount /afs/grand.central.org
|
||||
umount /afs/cambridge.redhat.com
|
||||
umount /afs
|
||||
rmmod kafs
|
||||
rmmod rxrpc
|
118
Documentation/filesystems/automount-support.txt
Normal file
118
Documentation/filesystems/automount-support.txt
Normal file
@@ -0,0 +1,118 @@
|
||||
Support is available for filesystems that wish to do automounting support (such
|
||||
as kAFS which can be found in fs/afs/). This facility includes allowing
|
||||
in-kernel mounts to be performed and mountpoint degradation to be
|
||||
requested. The latter can also be requested by userspace.
|
||||
|
||||
|
||||
======================
|
||||
IN-KERNEL AUTOMOUNTING
|
||||
======================
|
||||
|
||||
A filesystem can now mount another filesystem on one of its directories by the
|
||||
following procedure:
|
||||
|
||||
(1) Give the directory a follow_link() operation.
|
||||
|
||||
When the directory is accessed, the follow_link op will be called, and
|
||||
it will be provided with the location of the mountpoint in the nameidata
|
||||
structure (vfsmount and dentry).
|
||||
|
||||
(2) Have the follow_link() op do the following steps:
|
||||
|
||||
(a) Call do_kern_mount() to call the appropriate filesystem to set up a
|
||||
superblock and gain a vfsmount structure representing it.
|
||||
|
||||
(b) Copy the nameidata provided as an argument and substitute the dentry
|
||||
argument into it the copy.
|
||||
|
||||
(c) Call do_add_mount() to install the new vfsmount into the namespace's
|
||||
mountpoint tree, thus making it accessible to userspace. Use the
|
||||
nameidata set up in (b) as the destination.
|
||||
|
||||
If the mountpoint will be automatically expired, then do_add_mount()
|
||||
should also be given the location of an expiration list (see further
|
||||
down).
|
||||
|
||||
(d) Release the path in the nameidata argument and substitute in the new
|
||||
vfsmount and its root dentry. The ref counts on these will need
|
||||
incrementing.
|
||||
|
||||
Then from userspace, you can just do something like:
|
||||
|
||||
[root@andromeda root]# mount -t afs \#root.afs. /afs
|
||||
[root@andromeda root]# ls /afs
|
||||
asd cambridge cambridge.redhat.com grand.central.org
|
||||
[root@andromeda root]# ls /afs/cambridge
|
||||
afsdoc
|
||||
[root@andromeda root]# ls /afs/cambridge/afsdoc/
|
||||
ChangeLog html LICENSE pdf RELNOTES-1.2.2
|
||||
|
||||
And then if you look in the mountpoint catalogue, you'll see something like:
|
||||
|
||||
[root@andromeda root]# cat /proc/mounts
|
||||
...
|
||||
#root.afs. /afs afs rw 0 0
|
||||
#root.cell. /afs/cambridge.redhat.com afs rw 0 0
|
||||
#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
|
||||
|
||||
|
||||
===========================
|
||||
AUTOMATIC MOUNTPOINT EXPIRY
|
||||
===========================
|
||||
|
||||
Automatic expiration of mountpoints is easy, provided you've mounted the
|
||||
mountpoint to be expired in the automounting procedure outlined above.
|
||||
|
||||
To do expiration, you need to follow these steps:
|
||||
|
||||
(3) Create at least one list off which the vfsmounts to be expired can be
|
||||
hung. Access to this list will be governed by the vfsmount_lock.
|
||||
|
||||
(4) In step (2c) above, the call to do_add_mount() should be provided with a
|
||||
pointer to this list. It will hang the vfsmount off of it if it succeeds.
|
||||
|
||||
(5) When you want mountpoints to be expired, call mark_mounts_for_expiry()
|
||||
with a pointer to this list. This will process the list, marking every
|
||||
vfsmount thereon for potential expiry on the next call.
|
||||
|
||||
If a vfsmount was already flagged for expiry, and if its usage count is 1
|
||||
(it's only referenced by its parent vfsmount), then it will be deleted
|
||||
from the namespace and thrown away (effectively unmounted).
|
||||
|
||||
It may prove simplest to simply call this at regular intervals, using
|
||||
some sort of timed event to drive it.
|
||||
|
||||
The expiration flag is cleared by calls to mntput. This means that expiration
|
||||
will only happen on the second expiration request after the last time the
|
||||
mountpoint was accessed.
|
||||
|
||||
If a mountpoint is moved, it gets removed from the expiration list. If a bind
|
||||
mount is made on an expirable mount, the new vfsmount will not be on the
|
||||
expiration list and will not expire.
|
||||
|
||||
If a namespace is copied, all mountpoints contained therein will be copied,
|
||||
and the copies of those that are on an expiration list will be added to the
|
||||
same expiration list.
|
||||
|
||||
|
||||
=======================
|
||||
USERSPACE DRIVEN EXPIRY
|
||||
=======================
|
||||
|
||||
As an alternative, it is possible for userspace to request expiry of any
|
||||
mountpoint (though some will be rejected - the current process's idea of the
|
||||
rootfs for example). It does this by passing the MNT_EXPIRE flag to
|
||||
umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH.
|
||||
|
||||
If the mountpoint in question is in referenced by something other than
|
||||
umount() or its parent mountpoint, an EBUSY error will be returned and the
|
||||
mountpoint will not be marked for expiration or unmounted.
|
||||
|
||||
If the mountpoint was not already marked for expiry at that time, an EAGAIN
|
||||
error will be given and it won't be unmounted.
|
||||
|
||||
Otherwise if it was already marked and it wasn't referenced, unmounting will
|
||||
take place as usual.
|
||||
|
||||
Again, the expiration flag is cleared every time anything other than umount()
|
||||
looks at a mountpoint.
|
117
Documentation/filesystems/befs.txt
Normal file
117
Documentation/filesystems/befs.txt
Normal file
@@ -0,0 +1,117 @@
|
||||
BeOS filesystem for Linux
|
||||
|
||||
Document last updated: Dec 6, 2001
|
||||
|
||||
WARNING
|
||||
=======
|
||||
Make sure you understand that this is alpha software. This means that the
|
||||
implementation is neither complete nor well-tested.
|
||||
|
||||
I DISCLAIM ALL RESPONSIBILTY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
|
||||
|
||||
LICENSE
|
||||
=====
|
||||
This software is covered by the GNU General Public License.
|
||||
See the file COPYING for the complete text of the license.
|
||||
Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
|
||||
|
||||
AUTHOR
|
||||
=====
|
||||
The largest part of the code written by Will Dyson <will_dyson@pobox.com>
|
||||
He has been working on the code since Aug 13, 2001. See the changelog for
|
||||
details.
|
||||
|
||||
Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
|
||||
His orriginal code can still be found at:
|
||||
<http://hp.vector.co.jp/authors/VA008030/bfs/>
|
||||
Does anyone know of a more current email address for Makoto? He doesn't
|
||||
respond to the address given above...
|
||||
|
||||
Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru>
|
||||
|
||||
WHAT IS THIS DRIVER?
|
||||
==================
|
||||
This module implements the native filesystem of BeOS <http://www.be.com/>
|
||||
for the linux 2.4.1 and later kernels. Currently it is a read-only
|
||||
implementation.
|
||||
|
||||
Which is it, BFS or BEFS?
|
||||
================
|
||||
Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
|
||||
But Unixware Boot Filesystem is called bfs, too. And they are already in
|
||||
the kernel. Because of this nameing conflict, on Linux the BeOS
|
||||
filesystem is called befs.
|
||||
|
||||
HOW TO INSTALL
|
||||
==============
|
||||
step 1. Install the BeFS patch into the source code tree of linux.
|
||||
|
||||
Apply the patchfile to your kernel source tree.
|
||||
Assuming that your kernel source is in /foo/bar/linux and the patchfile
|
||||
is called patch-befs-xxx, you would do the following:
|
||||
|
||||
cd /foo/bar/linux
|
||||
patch -p1 < /path/to/patch-befs-xxx
|
||||
|
||||
if the patching step fails (i.e. there are rejected hunks), you can try to
|
||||
figure it out yourself (it shouldn't be hard), or mail the maintainer
|
||||
(Will Dyson <will_dyson@pobox.com>) for help.
|
||||
|
||||
step 2. Configuretion & make kernel
|
||||
|
||||
The linux kernel has many compile-time options. Most of them are beyond the
|
||||
scope of this document. I suggest the Kernel-HOWTO document as a good general
|
||||
reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html>
|
||||
|
||||
However, to use the BeFS module, you must enable it at configure time.
|
||||
|
||||
cd /foo/bar/linux
|
||||
make menuconfig (or xconfig)
|
||||
|
||||
The BeFS module is not a standard part of the linux kernel, so you must first
|
||||
enable support for experimental code under the "Code maturity level" menu.
|
||||
|
||||
Then, under the "Filesystems" menu will be an option called "BeFS
|
||||
filesystem (experimental)", or something like that. Enable that option
|
||||
(it is fine to make it a module).
|
||||
|
||||
Save your kernel configuration and then build your kernel.
|
||||
|
||||
step 3. Install
|
||||
|
||||
See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
|
||||
instructions on this critical step.
|
||||
|
||||
USING BFS
|
||||
=========
|
||||
To use the BeOS filesystem, use filesystem type 'befs'.
|
||||
|
||||
ex)
|
||||
mount -t befs /dev/fd0 /beos
|
||||
|
||||
MOUNT OPTIONS
|
||||
=============
|
||||
uid=nnn All files in the partition will be owned by user id nnn.
|
||||
gid=nnn All files in the partition will be in group nnn.
|
||||
iocharset=xxx Use xxx as the name of the NLS translation table.
|
||||
debug The driver will output debugging information to the syslog.
|
||||
|
||||
HOW TO GET LASTEST VERSION
|
||||
==========================
|
||||
|
||||
The latest version is currently available at:
|
||||
<http://befs-driver.sourceforge.net/>
|
||||
|
||||
ANY KNOWN BUGS?
|
||||
===========
|
||||
As of Jan 20, 2002:
|
||||
|
||||
None
|
||||
|
||||
SPECIAL THANKS
|
||||
==============
|
||||
Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
|
||||
Hiroyuki Yamada ... Testing LinuxPPC.
|
||||
|
||||
|
||||
|
57
Documentation/filesystems/bfs.txt
Normal file
57
Documentation/filesystems/bfs.txt
Normal file
@@ -0,0 +1,57 @@
|
||||
BFS FILESYSTEM FOR LINUX
|
||||
========================
|
||||
|
||||
The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
|
||||
usually contains the kernel image and a few other files required for the
|
||||
boot process.
|
||||
|
||||
In order to access /stand partition under Linux you obviously need to
|
||||
know the partition number and the kernel must support UnixWare disk slices
|
||||
(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
|
||||
depend on having UnixWare disklabel support because one can also mount
|
||||
BFS filesystem via loopback:
|
||||
|
||||
# losetup /dev/loop0 stand.img
|
||||
# mount -t bfs /dev/loop0 /mnt/stand
|
||||
|
||||
where stand.img is a file containing the image of BFS filesystem.
|
||||
When you have finished using it and umounted you need to also deallocate
|
||||
/dev/loop0 device by:
|
||||
|
||||
# losetup -d /dev/loop0
|
||||
|
||||
You can simplify mounting by just typing:
|
||||
|
||||
# mount -t bfs -o loop stand.img /mnt/stand
|
||||
|
||||
this will allocate the first available loopback device (and load loop.o
|
||||
kernel module if necessary) automatically. If the loopback driver is not
|
||||
loaded automatically, make sure that your kernel is compiled with kmod
|
||||
support (CONFIG_KMOD) enabled. Beware that umount will not
|
||||
deallocate /dev/loopN device if /etc/mtab file on your system is a
|
||||
symbolic link to /proc/mounts. You will need to do it manually using
|
||||
"-d" switch of losetup(8). Read losetup(8) manpage for more info.
|
||||
|
||||
To create the BFS image under UnixWare you need to find out first which
|
||||
slice contains it. The command prtvtoc(1M) is your friend:
|
||||
|
||||
# prtvtoc /dev/rdsk/c0b0t0d0s0
|
||||
|
||||
(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
|
||||
look for the slice with tag "STAND", which is usually slice 10. With this
|
||||
information you can use dd(1) to create the BFS image:
|
||||
|
||||
# umount /stand
|
||||
# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
|
||||
|
||||
Just in case, you can verify that you have done the right thing by checking
|
||||
the magic number:
|
||||
|
||||
# od -Ad -tx4 stand.img | more
|
||||
|
||||
The first 4 bytes should be 0x1badface.
|
||||
|
||||
If you have any patches, questions or suggestions regarding this BFS
|
||||
implementation please contact the author:
|
||||
|
||||
Tigran A. Aivazian <tigran@veritas.com>
|
51
Documentation/filesystems/cifs.txt
Normal file
51
Documentation/filesystems/cifs.txt
Normal file
@@ -0,0 +1,51 @@
|
||||
This is the client VFS module for the Common Internet File System
|
||||
(CIFS) protocol which is the successor to the Server Message Block
|
||||
(SMB) protocol, the native file sharing mechanism for most early
|
||||
PC operating systems. CIFS is fully supported by current network
|
||||
file servers such as Windows 2000, Windows 2003 (including
|
||||
Windows XP) as well by Samba (which provides excellent CIFS
|
||||
server support for Linux and many other operating systems), so
|
||||
this network filesystem client can mount to a wide variety of
|
||||
servers. The smbfs module should be used instead of this cifs module
|
||||
for mounting to older SMB servers such as OS/2. The smbfs and cifs
|
||||
modules can coexist and do not conflict. The CIFS VFS filesystem
|
||||
module is designed to work well with servers that implement the
|
||||
newer versions (dialects) of the SMB/CIFS protocol such as Samba,
|
||||
the program written by Andrew Tridgell that turns any Unix host
|
||||
into a SMB/CIFS file server.
|
||||
|
||||
The intent of this module is to provide the most advanced network
|
||||
file system function for CIFS compliant servers, including better
|
||||
POSIX compliance, secure per-user session establishment, high
|
||||
performance safe distributed caching (oplock), optional packet
|
||||
signing, large files, Unicode support and other internationalization
|
||||
improvements. Since both Samba server and this filesystem client support
|
||||
the CIFS Unix extensions, the combination can provide a reasonable
|
||||
alternative to NFSv4 for fileserving in some Linux to Linux environments,
|
||||
not just in Linux to Windows environments.
|
||||
|
||||
This filesystem has an optional mount utility (mount.cifs) that can
|
||||
be obtained from the project page and installed in the path in the same
|
||||
directory with the other mount helpers (such as mount.smbfs).
|
||||
Mounting using the cifs filesystem without installing the mount helper
|
||||
requires specifying the server's ip address.
|
||||
|
||||
For Linux 2.4:
|
||||
mount //anything/here /mnt_target -o
|
||||
user=username,pass=password,unc=//ip_address_of_server/sharename
|
||||
|
||||
For Linux 2.5:
|
||||
mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password
|
||||
|
||||
|
||||
For more information on the module see the project page at
|
||||
|
||||
http://us1.samba.org/samba/Linux_CIFS_client.html
|
||||
|
||||
For more information on CIFS see:
|
||||
|
||||
http://www.snia.org/tech_activities/CIFS
|
||||
|
||||
or the Samba site:
|
||||
|
||||
http://www.samba.org
|
1673
Documentation/filesystems/coda.txt
Normal file
1673
Documentation/filesystems/coda.txt
Normal file
File diff suppressed because it is too large
Load Diff
76
Documentation/filesystems/cramfs.txt
Normal file
76
Documentation/filesystems/cramfs.txt
Normal file
@@ -0,0 +1,76 @@
|
||||
|
||||
Cramfs - cram a filesystem onto a small ROM
|
||||
|
||||
cramfs is designed to be simple and small, and to compress things well.
|
||||
|
||||
It uses the zlib routines to compress a file one page at a time, and
|
||||
allows random page access. The meta-data is not compressed, but is
|
||||
expressed in a very terse representation to make it use much less
|
||||
diskspace than traditional filesystems.
|
||||
|
||||
You can't write to a cramfs filesystem (making it compressible and
|
||||
compact also makes it _very_ hard to update on-the-fly), so you have to
|
||||
create the disk image with the "mkcramfs" utility.
|
||||
|
||||
|
||||
Usage Notes
|
||||
-----------
|
||||
|
||||
File sizes are limited to less than 16MB.
|
||||
|
||||
Maximum filesystem size is a little over 256MB. (The last file on the
|
||||
filesystem is allowed to extend past 256MB.)
|
||||
|
||||
Only the low 8 bits of gid are stored. The current version of
|
||||
mkcramfs simply truncates to 8 bits, which is a potential security
|
||||
issue.
|
||||
|
||||
Hard links are supported, but hard linked files
|
||||
will still have a link count of 1 in the cramfs image.
|
||||
|
||||
Cramfs directories have no `.' or `..' entries. Directories (like
|
||||
every other file on cramfs) always have a link count of 1. (There's
|
||||
no need to use -noleaf in `find', btw.)
|
||||
|
||||
No timestamps are stored in a cramfs, so these default to the epoch
|
||||
(1970 GMT). Recently-accessed files may have updated timestamps, but
|
||||
the update lasts only as long as the inode is cached in memory, after
|
||||
which the timestamp reverts to 1970, i.e. moves backwards in time.
|
||||
|
||||
Currently, cramfs must be written and read with architectures of the
|
||||
same endianness, and can be read only by kernels with PAGE_CACHE_SIZE
|
||||
== 4096. At least the latter of these is a bug, but it hasn't been
|
||||
decided what the best fix is. For the moment if you have larger pages
|
||||
you can just change the #define in mkcramfs.c, so long as you don't
|
||||
mind the filesystem becoming unreadable to future kernels.
|
||||
|
||||
|
||||
For /usr/share/magic
|
||||
--------------------
|
||||
|
||||
0 ulelong 0x28cd3d45 Linux cramfs offset 0
|
||||
>4 ulelong x size %d
|
||||
>8 ulelong x flags 0x%x
|
||||
>12 ulelong x future 0x%x
|
||||
>16 string >\0 signature "%.16s"
|
||||
>32 ulelong x fsid.crc 0x%x
|
||||
>36 ulelong x fsid.edition %d
|
||||
>40 ulelong x fsid.blocks %d
|
||||
>44 ulelong x fsid.files %d
|
||||
>48 string >\0 name "%.16s"
|
||||
512 ulelong 0x28cd3d45 Linux cramfs offset 512
|
||||
>516 ulelong x size %d
|
||||
>520 ulelong x flags 0x%x
|
||||
>524 ulelong x future 0x%x
|
||||
>528 string >\0 signature "%.16s"
|
||||
>544 ulelong x fsid.crc 0x%x
|
||||
>548 ulelong x fsid.edition %d
|
||||
>552 ulelong x fsid.blocks %d
|
||||
>556 ulelong x fsid.files %d
|
||||
>560 string >\0 name "%.16s"
|
||||
|
||||
|
||||
Hacker Notes
|
||||
------------
|
||||
|
||||
See fs/cramfs/README for filesystem layout and implementation notes.
|
1977
Documentation/filesystems/devfs/ChangeLog
Normal file
1977
Documentation/filesystems/devfs/ChangeLog
Normal file
File diff suppressed because it is too large
Load Diff
1964
Documentation/filesystems/devfs/README
Normal file
1964
Documentation/filesystems/devfs/README
Normal file
File diff suppressed because it is too large
Load Diff
40
Documentation/filesystems/devfs/ToDo
Normal file
40
Documentation/filesystems/devfs/ToDo
Normal file
@@ -0,0 +1,40 @@
|
||||
Device File System (devfs) ToDo List
|
||||
|
||||
Richard Gooch <rgooch@atnf.csiro.au>
|
||||
|
||||
3-JUL-2000
|
||||
|
||||
This is a list of things to be done for better devfs support in the
|
||||
Linux kernel. If you'd like to contribute to the devfs, please have a
|
||||
look at this list for anything that is unallocated. Also, if there are
|
||||
items missing (surely), please contact me so I can add them to the
|
||||
list (preferably with your name attached to them:-).
|
||||
|
||||
|
||||
- >256 ptys
|
||||
Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
|
||||
|
||||
- Amiga floppy driver (drivers/block/amiflop.c)
|
||||
|
||||
- Atari floppy driver (drivers/block/ataflop.c)
|
||||
|
||||
- SWIM3 (Super Woz Integrated Machine 3) floppy driver (drivers/block/swim3.c)
|
||||
|
||||
- Amiga ZorroII ramdisc driver (drivers/block/z2ram.c)
|
||||
|
||||
- Parallel port ATAPI CD-ROM (drivers/block/paride/pcd.c)
|
||||
|
||||
- Parallel port ATAPI floppy (drivers/block/paride/pf.c)
|
||||
|
||||
- AP1000 block driver (drivers/ap1000/ap.c, drivers/ap1000/ddv.c)
|
||||
|
||||
- Archimedes floppy (drivers/acorn/block/fd1772.c)
|
||||
|
||||
- MFM hard drive (drivers/acorn/block/mfmhd.c)
|
||||
|
||||
- I2O block device (drivers/message/i2o/i2o_block.c)
|
||||
|
||||
- ST-RAM device (arch/m68k/atari/stram.c)
|
||||
|
||||
- Raw devices
|
||||
|
65
Documentation/filesystems/devfs/boot-options
Normal file
65
Documentation/filesystems/devfs/boot-options
Normal file
@@ -0,0 +1,65 @@
|
||||
/* -*- auto-fill -*- */
|
||||
|
||||
Device File System (devfs) Boot Options
|
||||
|
||||
Richard Gooch <rgooch@atnf.csiro.au>
|
||||
|
||||
18-AUG-2001
|
||||
|
||||
|
||||
When CONFIG_DEVFS_DEBUG is enabled, you can pass several boot options
|
||||
to the kernel to debug devfs. The boot options are prefixed by
|
||||
"devfs=", and are separated by commas. Spaces are not allowed. The
|
||||
syntax looks like this:
|
||||
|
||||
devfs=<option1>,<option2>,<option3>
|
||||
|
||||
and so on. For example, if you wanted to turn on debugging for module
|
||||
load requests and device registration, you would do:
|
||||
|
||||
devfs=dmod,dreg
|
||||
|
||||
You may prefix "no" to any option. This will invert the option.
|
||||
|
||||
|
||||
Debugging Options
|
||||
=================
|
||||
|
||||
These requires CONFIG_DEVFS_DEBUG to be enabled.
|
||||
Note that all debugging options have 'd' as the first character. By
|
||||
default all options are off. All debugging output is sent to the
|
||||
kernel logs. The debugging options do not take effect until the devfs
|
||||
version message appears (just prior to the root filesystem being
|
||||
mounted).
|
||||
|
||||
These are the options:
|
||||
|
||||
dmod print module load requests to <request_module>
|
||||
|
||||
dreg print device register requests to <devfs_register>
|
||||
|
||||
dunreg print device unregister requests to <devfs_unregister>
|
||||
|
||||
dchange print device change requests to <devfs_set_flags>
|
||||
|
||||
dilookup print inode lookup requests
|
||||
|
||||
diget print VFS inode allocations
|
||||
|
||||
diunlink print inode unlinks
|
||||
|
||||
dichange print inode changes
|
||||
|
||||
dimknod print calls to mknod(2)
|
||||
|
||||
dall some debugging turned on
|
||||
|
||||
|
||||
Other Options
|
||||
=============
|
||||
|
||||
These control the default behaviour of devfs. The options are:
|
||||
|
||||
mount mount devfs onto /dev at boot time
|
||||
|
||||
only disable non-devfs device nodes for devfs-capable drivers
|
113
Documentation/filesystems/directory-locking
Normal file
113
Documentation/filesystems/directory-locking
Normal file
@@ -0,0 +1,113 @@
|
||||
Locking scheme used for directory operations is based on two
|
||||
kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
|
||||
|
||||
For our purposes all operations fall in 5 classes:
|
||||
|
||||
1) read access. Locking rules: caller locks directory we are accessing.
|
||||
|
||||
2) object creation. Locking rules: same as above.
|
||||
|
||||
3) object removal. Locking rules: caller locks parent, finds victim,
|
||||
locks victim and calls the method.
|
||||
|
||||
4) rename() that is _not_ cross-directory. Locking rules: caller locks
|
||||
the parent, finds source and target, if target already exists - locks it
|
||||
and then calls the method.
|
||||
|
||||
5) link creation. Locking rules:
|
||||
* lock parent
|
||||
* check that source is not a directory
|
||||
* lock source
|
||||
* call the method.
|
||||
|
||||
6) cross-directory rename. The trickiest in the whole bunch. Locking
|
||||
rules:
|
||||
* lock the filesystem
|
||||
* lock parents in "ancestors first" order.
|
||||
* find source and target.
|
||||
* if old parent is equal to or is a descendent of target
|
||||
fail with -ENOTEMPTY
|
||||
* if new parent is equal to or is a descendent of source
|
||||
fail with -ELOOP
|
||||
* if target exists - lock it.
|
||||
* call the method.
|
||||
|
||||
|
||||
The rules above obviously guarantee that all directories that are going to be
|
||||
read, modified or removed by method will be locked by caller.
|
||||
|
||||
|
||||
If no directory is its own ancestor, the scheme above is deadlock-free.
|
||||
Proof:
|
||||
|
||||
First of all, at any moment we have a partial ordering of the
|
||||
objects - A < B iff A is an ancestor of B.
|
||||
|
||||
That ordering can change. However, the following is true:
|
||||
|
||||
(1) if object removal or non-cross-directory rename holds lock on A and
|
||||
attempts to acquire lock on B, A will remain the parent of B until we
|
||||
acquire the lock on B. (Proof: only cross-directory rename can change
|
||||
the parent of object and it would have to lock the parent).
|
||||
|
||||
(2) if cross-directory rename holds the lock on filesystem, order will not
|
||||
change until rename acquires all locks. (Proof: other cross-directory
|
||||
renames will be blocked on filesystem lock and we don't start changing
|
||||
the order until we had acquired all locks).
|
||||
|
||||
(3) any operation holds at most one lock on non-directory object and
|
||||
that lock is acquired after all other locks. (Proof: see descriptions
|
||||
of operations).
|
||||
|
||||
Now consider the minimal deadlock. Each process is blocked on
|
||||
attempt to acquire some lock and already holds at least one lock. Let's
|
||||
consider the set of contended locks. First of all, filesystem lock is
|
||||
not contended, since any process blocked on it is not holding any locks.
|
||||
Thus all processes are blocked on ->i_sem.
|
||||
|
||||
Non-directory objects are not contended due to (3). Thus link
|
||||
creation can't be a part of deadlock - it can't be blocked on source
|
||||
and it means that it doesn't hold any locks.
|
||||
|
||||
Any contended object is either held by cross-directory rename or
|
||||
has a child that is also contended. Indeed, suppose that it is held by
|
||||
operation other than cross-directory rename. Then the lock this operation
|
||||
is blocked on belongs to child of that object due to (1).
|
||||
|
||||
It means that one of the operations is cross-directory rename.
|
||||
Otherwise the set of contended objects would be infinite - each of them
|
||||
would have a contended child and we had assumed that no object is its
|
||||
own descendent. Moreover, there is exactly one cross-directory rename
|
||||
(see above).
|
||||
|
||||
Consider the object blocking the cross-directory rename. One
|
||||
of its descendents is locked by cross-directory rename (otherwise we
|
||||
would again have an infinite set of of contended objects). But that
|
||||
means that cross-directory rename is taking locks out of order. Due
|
||||
to (2) the order hadn't changed since we had acquired filesystem lock.
|
||||
But locking rules for cross-directory rename guarantee that we do not
|
||||
try to acquire lock on descendent before the lock on ancestor.
|
||||
Contradiction. I.e. deadlock is impossible. Q.E.D.
|
||||
|
||||
|
||||
These operations are guaranteed to avoid loop creation. Indeed,
|
||||
the only operation that could introduce loops is cross-directory rename.
|
||||
Since the only new (parent, child) pair added by rename() is (new parent,
|
||||
source), such loop would have to contain these objects and the rest of it
|
||||
would have to exist before rename(). I.e. at the moment of loop creation
|
||||
rename() responsible for that would be holding filesystem lock and new parent
|
||||
would have to be equal to or a descendent of source. But that means that
|
||||
new parent had been equal to or a descendent of source since the moment when
|
||||
we had acquired filesystem lock and rename() would fail with -ELOOP in that
|
||||
case.
|
||||
|
||||
While this locking scheme works for arbitrary DAGs, it relies on
|
||||
ability to check that directory is a descendent of another object. Current
|
||||
implementation assumes that directory graph is a tree. This assumption is
|
||||
also preserved by all operations (cross-directory rename on a tree that would
|
||||
not introduce a cycle will leave it a tree and link() fails for directories).
|
||||
|
||||
Notice that "directory" in the above == "anything that might have
|
||||
children", so if we are going to introduce hybrid objects we will need
|
||||
either to make sure that link(2) doesn't work for them or to make changes
|
||||
in is_subdir() that would make it work even in presence of such beasts.
|
383
Documentation/filesystems/ext2.txt
Normal file
383
Documentation/filesystems/ext2.txt
Normal file
@@ -0,0 +1,383 @@
|
||||
|
||||
The Second Extended Filesystem
|
||||
==============================
|
||||
|
||||
ext2 was originally released in January 1993. Written by R\'emy Card,
|
||||
Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
|
||||
Extended Filesystem. It is currently still (April 2001) the predominant
|
||||
filesystem in use by Linux. There are also implementations available
|
||||
for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
|
||||
|
||||
Options
|
||||
=======
|
||||
|
||||
Most defaults are determined by the filesystem superblock, and can be
|
||||
set using tune2fs(8). Kernel-determined defaults are indicated by (*).
|
||||
|
||||
bsddf (*) Makes `df' act like BSD.
|
||||
minixdf Makes `df' act like Minix.
|
||||
|
||||
check Check block and inode bitmaps at mount time
|
||||
(requires CONFIG_EXT2_CHECK).
|
||||
check=none, nocheck (*) Don't do extra checking of bitmaps on mount
|
||||
(check=normal and check=strict options removed)
|
||||
|
||||
debug Extra debugging information is sent to the
|
||||
kernel syslog. Useful for developers.
|
||||
|
||||
errors=continue Keep going on a filesystem error.
|
||||
errors=remount-ro Remount the filesystem read-only on an error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
|
||||
grpid, bsdgroups Give objects the same group ID as their parent.
|
||||
nogrpid, sysvgroups New objects have the group ID of their creator.
|
||||
|
||||
nouid32 Use 16-bit UIDs and GIDs.
|
||||
|
||||
oldalloc Enable the old block allocator. Orlov should
|
||||
have better performance, we'd like to get some
|
||||
feedback if it's the contrary for you.
|
||||
orlov (*) Use the Orlov block allocator.
|
||||
(See http://lwn.net/Articles/14633/ and
|
||||
http://lwn.net/Articles/14446/.)
|
||||
|
||||
resuid=n The user ID which may use the reserved blocks.
|
||||
resgid=n The group ID which may use the reserved blocks.
|
||||
|
||||
sb=n Use alternate superblock at this location.
|
||||
|
||||
user_xattr Enable "user." POSIX Extended Attributes
|
||||
(requires CONFIG_EXT2_FS_XATTR).
|
||||
See also http://acl.bestbits.at
|
||||
nouser_xattr Don't support "user." extended attributes.
|
||||
|
||||
acl Enable POSIX Access Control Lists support
|
||||
(requires CONFIG_EXT2_FS_POSIX_ACL).
|
||||
See also http://acl.bestbits.at
|
||||
noacl Don't support POSIX ACLs.
|
||||
|
||||
nobh Do not attach buffer_heads to file pagecache.
|
||||
|
||||
grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
ext2 shares many properties with traditional Unix filesystems. It has
|
||||
the concepts of blocks, inodes and directories. It has space in the
|
||||
specification for Access Control Lists (ACLs), fragments, undeletion and
|
||||
compression though these are not yet implemented (some are available as
|
||||
separate patches). There is also a versioning mechanism to allow new
|
||||
features (such as journalling) to be added in a maximally compatible
|
||||
manner.
|
||||
|
||||
Blocks
|
||||
------
|
||||
|
||||
The space in the device or file is split up into blocks. These are
|
||||
a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
|
||||
which is decided when the filesystem is created. Smaller blocks mean
|
||||
less wasted space per file, but require slightly more accounting overhead,
|
||||
and also impose other limits on the size of files and the filesystem.
|
||||
|
||||
Block Groups
|
||||
------------
|
||||
|
||||
Blocks are clustered into block groups in order to reduce fragmentation
|
||||
and minimise the amount of head seeking when reading a large amount
|
||||
of consecutive data. Information about each block group is kept in a
|
||||
descriptor table stored in the block(s) immediately after the superblock.
|
||||
Two blocks near the start of each group are reserved for the block usage
|
||||
bitmap and the inode usage bitmap which show which blocks and inodes
|
||||
are in use. Since each bitmap is limited to a single block, this means
|
||||
that the maximum size of a block group is 8 times the size of a block.
|
||||
|
||||
The block(s) following the bitmaps in each block group are designated
|
||||
as the inode table for that block group and the remainder are the data
|
||||
blocks. The block allocation algorithm attempts to allocate data blocks
|
||||
in the same block group as the inode which contains them.
|
||||
|
||||
The Superblock
|
||||
--------------
|
||||
|
||||
The superblock contains all the information about the configuration of
|
||||
the filing system. The primary copy of the superblock is stored at an
|
||||
offset of 1024 bytes from the start of the device, and it is essential
|
||||
to mounting the filesystem. Since it is so important, backup copies of
|
||||
the superblock are stored in block groups throughout the filesystem.
|
||||
The first version of ext2 (revision 0) stores a copy at the start of
|
||||
every block group, along with backups of the group descriptor block(s).
|
||||
Because this can consume a considerable amount of space for large
|
||||
filesystems, later revisions can optionally reduce the number of backup
|
||||
copies by only putting backups in specific groups (this is the sparse
|
||||
superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
|
||||
|
||||
The information in the superblock contains fields such as the total
|
||||
number of inodes and blocks in the filesystem and how many are free,
|
||||
how many inodes and blocks are in each block group, when the filesystem
|
||||
was mounted (and if it was cleanly unmounted), when it was modified,
|
||||
what version of the filesystem it is (see the Revisions section below)
|
||||
and which OS created it.
|
||||
|
||||
If the filesystem is revision 1 or higher, then there are extra fields,
|
||||
such as a volume name, a unique identification number, the inode size,
|
||||
and space for optional filesystem features to store configuration info.
|
||||
|
||||
All fields in the superblock (as in all other ext2 structures) are stored
|
||||
on the disc in little endian format, so a filesystem is portable between
|
||||
machines without having to know what machine it was created on.
|
||||
|
||||
Inodes
|
||||
------
|
||||
|
||||
The inode (index node) is a fundamental concept in the ext2 filesystem.
|
||||
Each object in the filesystem is represented by an inode. The inode
|
||||
structure contains pointers to the filesystem blocks which contain the
|
||||
data held in the object and all of the metadata about an object except
|
||||
its name. The metadata about an object includes the permissions, owner,
|
||||
group, flags, size, number of blocks used, access time, change time,
|
||||
modification time, deletion time, number of links, fragments, version
|
||||
(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
|
||||
|
||||
There are some reserved fields which are currently unused in the inode
|
||||
structure and several which are overloaded. One field is reserved for the
|
||||
directory ACL if the inode is a directory and alternately for the top 32
|
||||
bits of the file size if the inode is a regular file (allowing file sizes
|
||||
larger than 2GB). The translator field is unused under Linux, but is used
|
||||
by the HURD to reference the inode of a program which will be used to
|
||||
interpret this object. Most of the remaining reserved fields have been
|
||||
used up for both Linux and the HURD for larger owner and group fields,
|
||||
The HURD also has a larger mode field so it uses another of the remaining
|
||||
fields to store the extra more bits.
|
||||
|
||||
There are pointers to the first 12 blocks which contain the file's data
|
||||
in the inode. There is a pointer to an indirect block (which contains
|
||||
pointers to the next set of blocks), a pointer to a doubly-indirect
|
||||
block (which contains pointers to indirect blocks) and a pointer to a
|
||||
trebly-indirect block (which contains pointers to doubly-indirect blocks).
|
||||
|
||||
The flags field contains some ext2-specific flags which aren't catered
|
||||
for by the standard chmod flags. These flags can be listed with lsattr
|
||||
and changed with the chattr command, and allow specific filesystem
|
||||
behaviour on a per-file basis. There are flags for secure deletion,
|
||||
undeletable, compression, synchronous updates, immutability, append-only,
|
||||
dumpable, no-atime, indexed directories, and data-journaling. Not all
|
||||
of these are supported yet.
|
||||
|
||||
Directories
|
||||
-----------
|
||||
|
||||
A directory is a filesystem object and has an inode just like a file.
|
||||
It is a specially formatted file containing records which associate
|
||||
each name with an inode number. Later revisions of the filesystem also
|
||||
encode the type of the object (file, directory, symlink, device, fifo,
|
||||
socket) to avoid the need to check the inode itself for this information
|
||||
(support for taking advantage of this feature does not yet exist in
|
||||
Glibc 2.2).
|
||||
|
||||
The inode allocation code tries to assign inodes which are in the same
|
||||
block group as the directory in which they are first created.
|
||||
|
||||
The current implementation of ext2 uses a singly-linked list to store
|
||||
the filenames in the directory; a pending enhancement uses hashing of the
|
||||
filenames to allow lookup without the need to scan the entire directory.
|
||||
|
||||
The current implementation never removes empty directory blocks once they
|
||||
have been allocated to hold more files.
|
||||
|
||||
Special files
|
||||
-------------
|
||||
|
||||
Symbolic links are also filesystem objects with inodes. They deserve
|
||||
special mention because the data for them is stored within the inode
|
||||
itself if the symlink is less than 60 bytes long. It uses the fields
|
||||
which would normally be used to store the pointers to data blocks.
|
||||
This is a worthwhile optimisation as it we avoid allocating a full
|
||||
block for the symlink, and most symlinks are less than 60 characters long.
|
||||
|
||||
Character and block special devices never have data blocks assigned to
|
||||
them. Instead, their device number is stored in the inode, again reusing
|
||||
the fields which would be used to point to the data blocks.
|
||||
|
||||
Reserved Space
|
||||
--------------
|
||||
|
||||
In ext2, there is a mechanism for reserving a certain number of blocks
|
||||
for a particular user (normally the super-user). This is intended to
|
||||
allow for the system to continue functioning even if non-priveleged users
|
||||
fill up all the space available to them (this is independent of filesystem
|
||||
quotas). It also keeps the filesystem from filling up entirely which
|
||||
helps combat fragmentation.
|
||||
|
||||
Filesystem check
|
||||
----------------
|
||||
|
||||
At boot time, most systems run a consistency check (e2fsck) on their
|
||||
filesystems. The superblock of the ext2 filesystem contains several
|
||||
fields which indicate whether fsck should actually run (since checking
|
||||
the filesystem at boot can take a long time if it is large). fsck will
|
||||
run if the filesystem was not cleanly unmounted, if the maximum mount
|
||||
count has been exceeded or if the maximum time between checks has been
|
||||
exceeded.
|
||||
|
||||
Feature Compatibility
|
||||
---------------------
|
||||
|
||||
The compatibility feature mechanism used in ext2 is sophisticated.
|
||||
It safely allows features to be added to the filesystem, without
|
||||
unnecessarily sacrificing compatibility with older versions of the
|
||||
filesystem code. The feature compatibility mechanism is not supported by
|
||||
the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
|
||||
revision 1. There are three 32-bit fields, one for compatible features
|
||||
(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
|
||||
incompatible (INCOMPAT) features.
|
||||
|
||||
These feature flags have specific meanings for the kernel as follows:
|
||||
|
||||
A COMPAT flag indicates that a feature is present in the filesystem,
|
||||
but the on-disk format is 100% compatible with older on-disk formats, so
|
||||
a kernel which didn't know anything about this feature could read/write
|
||||
the filesystem without any chance of corrupting the filesystem (or even
|
||||
making it inconsistent). This is essentially just a flag which says
|
||||
"this filesystem has a (hidden) feature" that the kernel or e2fsck may
|
||||
want to be aware of (more on e2fsck and feature flags later). The ext3
|
||||
HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
|
||||
a regular file with data blocks in it so the kernel does not need to
|
||||
take any special notice of it if it doesn't understand ext3 journaling.
|
||||
|
||||
An RO_COMPAT flag indicates that the on-disk format is 100% compatible
|
||||
with older on-disk formats for reading (i.e. the feature does not change
|
||||
the visible on-disk format). However, an old kernel writing to such a
|
||||
filesystem would/could corrupt the filesystem, so this is prevented. The
|
||||
most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
|
||||
sparse groups allow file data blocks where superblock/group descriptor
|
||||
backups used to live, and ext2_free_blocks() refuses to free these blocks,
|
||||
which would leading to inconsistent bitmaps. An old kernel would also
|
||||
get an error if it tried to free a series of blocks which crossed a group
|
||||
boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
|
||||
|
||||
An INCOMPAT flag indicates the on-disk format has changed in some
|
||||
way that makes it unreadable by older kernels, or would otherwise
|
||||
cause a problem if an old kernel tried to mount it. FILETYPE is an
|
||||
INCOMPAT flag because older kernels would think a filename was longer
|
||||
than 256 characters, which would lead to corrupt directory listings.
|
||||
The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
|
||||
doesn't understand compression, you would just get garbage back from
|
||||
read() instead of it automatically decompressing your data. The ext3
|
||||
RECOVER flag is needed to prevent a kernel which does not understand the
|
||||
ext3 journal from mounting the filesystem without replaying the journal.
|
||||
|
||||
For e2fsck, it needs to be more strict with the handling of these
|
||||
flags than the kernel. If it doesn't understand ANY of the COMPAT,
|
||||
RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
|
||||
because it has no way of verifying whether a given feature is valid
|
||||
or not. Allowing e2fsck to succeed on a filesystem with an unknown
|
||||
feature is a false sense of security for the user. Refusing to check
|
||||
a filesystem with unknown features is a good incentive for the user to
|
||||
update to the latest e2fsck. This also means that anyone adding feature
|
||||
flags to ext2 also needs to update e2fsck to verify these features.
|
||||
|
||||
Metadata
|
||||
--------
|
||||
|
||||
It is frequently claimed that the ext2 implementation of writing
|
||||
asynchronous metadata is faster than the ffs synchronous metadata
|
||||
scheme but less reliable. Both methods are equally resolvable by their
|
||||
respective fsck programs.
|
||||
|
||||
If you're exceptionally paranoid, there are 3 ways of making metadata
|
||||
writes synchronous on ext2:
|
||||
|
||||
per-file if you have the program source: use the O_SYNC flag to open()
|
||||
per-file if you don't have the source: use "chattr +S" on the file
|
||||
per-filesystem: add the "sync" option to mount (or in /etc/fstab)
|
||||
|
||||
the first and last are not ext2 specific but do force the metadata to
|
||||
be written synchronously. See also Journaling below.
|
||||
|
||||
Limitations
|
||||
-----------
|
||||
|
||||
There are various limits imposed by the on-disk layout of ext2. Other
|
||||
limits are imposed by the current implementation of the kernel code.
|
||||
Many of the limits are determined at the time the filesystem is first
|
||||
created, and depend upon the block size chosen. The ratio of inodes to
|
||||
data blocks is fixed at filesystem creation time, so the only way to
|
||||
increase the number of inodes is to increase the size of the filesystem.
|
||||
No tools currently exist which can change the ratio of inodes to blocks.
|
||||
|
||||
Most of these limits could be overcome with slight changes in the on-disk
|
||||
format and using a compatibility flag to signal the format change (at
|
||||
the expense of some compatibility).
|
||||
|
||||
Filesystem block size: 1kB 2kB 4kB 8kB
|
||||
|
||||
File size limit: 16GB 256GB 2048GB 2048GB
|
||||
Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
|
||||
|
||||
There is a 2.4 kernel limit of 2048GB for a single block device, so no
|
||||
filesystem larger than that can be created at this time. There is also
|
||||
an upper limit on the block size imposed by the page size of the kernel,
|
||||
so 8kB blocks are only allowed on Alpha systems (and other architectures
|
||||
which support larger pages).
|
||||
|
||||
There is an upper limit of 32768 subdirectories in a single directory.
|
||||
|
||||
There is a "soft" upper limit of about 10-15k files in a single directory
|
||||
with the current linear linked-list directory implementation. This limit
|
||||
stems from performance problems when creating and deleting (and also
|
||||
finding) files in such large directories. Using a hashed directory index
|
||||
(under development) allows 100k-1M+ files in a single directory without
|
||||
performance problems (although RAM size becomes an issue at this point).
|
||||
|
||||
The (meaningless) absolute upper limit of files in a single directory
|
||||
(imposed by the file size, the realistic limit is obviously much less)
|
||||
is over 130 trillion files. It would be higher except there are not
|
||||
enough 4-character names to make up unique directory entries, so they
|
||||
have to be 8 character filenames, even then we are fairly close to
|
||||
running out of unique filenames.
|
||||
|
||||
Journaling
|
||||
----------
|
||||
|
||||
A journaling extension to the ext2 code has been developed by Stephen
|
||||
Tweedie. It avoids the risks of metadata corruption and the need to
|
||||
wait for e2fsck to complete after a crash, without requiring a change
|
||||
to the on-disk ext2 layout. In a nutshell, the journal is a regular
|
||||
file which stores whole metadata (and optionally data) blocks that have
|
||||
been modified, prior to writing them into the filesystem. This means
|
||||
it is possible to add a journal to an existing ext2 filesystem without
|
||||
the need for data conversion.
|
||||
|
||||
When changes to the filesystem (e.g. a file is renamed) they are stored in
|
||||
a transaction in the journal and can either be complete or incomplete at
|
||||
the time of a crash. If a transaction is complete at the time of a crash
|
||||
(or in the normal case where the system does not crash), then any blocks
|
||||
in that transaction are guaranteed to represent a valid filesystem state,
|
||||
and are copied into the filesystem. If a transaction is incomplete at
|
||||
the time of the crash, then there is no guarantee of consistency for
|
||||
the blocks in that transaction so they are discarded (which means any
|
||||
filesystem changes they represent are also lost).
|
||||
Check Documentation/filesystems/ext3.txt if you want to read more about
|
||||
ext3 and journaling.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
The kernel source file:/usr/src/linux/fs/ext2/
|
||||
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
|
||||
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
|
||||
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
|
||||
Hashed Directories http://kernelnewbies.org/~phillips/htree/
|
||||
Filesystem Resizing http://ext2resize.sourceforge.net/
|
||||
Compression (*) http://www.netspace.net.au/~reiter/e2compr/
|
||||
|
||||
Implementations for:
|
||||
Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm
|
||||
Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2
|
||||
DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
|
||||
OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/
|
||||
RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/
|
||||
|
||||
(*) no longer actively developed/supported (as of Apr 2001)
|
183
Documentation/filesystems/ext3.txt
Normal file
183
Documentation/filesystems/ext3.txt
Normal file
@@ -0,0 +1,183 @@
|
||||
|
||||
Ext3 Filesystem
|
||||
===============
|
||||
|
||||
ext3 was originally released in September 1999. Written by Stephen Tweedie
|
||||
for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
|
||||
Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
|
||||
|
||||
ext3 is ext2 filesystem enhanced with journalling capabilities.
|
||||
|
||||
Options
|
||||
=======
|
||||
|
||||
When mounting an ext3 filesystem, the following option are accepted:
|
||||
(*) == default
|
||||
|
||||
jounal=update Update the ext3 file system's journal to the
|
||||
current format.
|
||||
|
||||
journal=inum When a journal already exists, this option is
|
||||
ignored. Otherwise, it specifies the number of
|
||||
the inode which will represent the ext3 file
|
||||
system's journal file.
|
||||
|
||||
noload Don't load the journal on mounting.
|
||||
|
||||
data=journal All data are committed into the journal prior
|
||||
to being written into the main file system.
|
||||
|
||||
data=ordered (*) All data are forced directly out to the main file
|
||||
system prior to its metadata being committed to
|
||||
the journal.
|
||||
|
||||
data=writeback Data ordering is not preserved, data may be
|
||||
written into the main file system after its
|
||||
metadata has been committed to the journal.
|
||||
|
||||
commit=nrsec (*) Ext3 can be told to sync all its data and metadata
|
||||
every 'nrsec' seconds. The default value is 5 seconds.
|
||||
This means that if you lose your power, you will lose,
|
||||
as much, the latest 5 seconds of work (your filesystem
|
||||
will not be damaged though, thanks to journaling). This
|
||||
default value (or any low value) will hurt performance,
|
||||
but it's good for data-safety. Setting it to 0 will
|
||||
have the same effect than leaving the default 5 sec.
|
||||
Setting it to very large values will improve
|
||||
performance.
|
||||
|
||||
barrier=1 This enables/disables barriers. barrier=0 disables it,
|
||||
barrier=1 enables it.
|
||||
|
||||
orlov (*) This enables the new Orlov block allocator. It's enabled
|
||||
by default.
|
||||
|
||||
oldalloc This disables the Orlov block allocator and enables the
|
||||
old block allocator. Orlov should have better performance,
|
||||
we'd like to get some feedback if it's the contrary for
|
||||
you.
|
||||
|
||||
user_xattr (*) Enables POSIX Extended Attributes. It's enabled by
|
||||
default, however you need to confifure its support
|
||||
(CONFIG_EXT3_FS_XATTR). This is neccesary if you want
|
||||
to use POSIX Acces Control Lists support. You can visit
|
||||
http://acl.bestbits.at to know more about POSIX Extended
|
||||
attributes.
|
||||
|
||||
nouser_xattr Disables POSIX Extended Attributes.
|
||||
|
||||
acl (*) Enables POSIX Access Control Lists support. This is
|
||||
enabled by default, however you need to configure
|
||||
its support (CONFIG_EXT3_FS_POSIX_ACL). If you want
|
||||
to know more about ACLs visit http://acl.bestbits.at
|
||||
|
||||
noacl This option disables POSIX Access Control List support.
|
||||
|
||||
reservation
|
||||
|
||||
noreservation
|
||||
|
||||
resize=
|
||||
|
||||
bsddf (*) Make 'df' act like BSD.
|
||||
minixdf Make 'df' act like Minix.
|
||||
|
||||
check=none Don't do extra checking of bitmaps on mount.
|
||||
nocheck
|
||||
|
||||
debug Extra debugging information is sent to syslog.
|
||||
|
||||
errors=remount-ro(*) Remount the filesystem read-only on an error.
|
||||
errors=continue Keep going on a filesystem error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
|
||||
grpid Give objects the same group ID as their creator.
|
||||
bsdgroups
|
||||
|
||||
nogrpid (*) New objects have the group ID of their creator.
|
||||
sysvgroups
|
||||
|
||||
resgid=n The group ID which may use the reserved blocks.
|
||||
|
||||
resuid=n The user ID which may use the reserved blocks.
|
||||
|
||||
sb=n Use alternate superblock at this location.
|
||||
|
||||
quota Quota options are currently silently ignored.
|
||||
noquota (see fs/ext3/super.c, line 594)
|
||||
grpquota
|
||||
usrquota
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
ext3 shares all disk implementation with ext2 filesystem, and add
|
||||
transactions capabilities to ext2. Journaling is done by the
|
||||
Journaling block device layer.
|
||||
|
||||
Journaling Block Device layer
|
||||
-----------------------------
|
||||
The Journaling Block Device layer (JBD) isn't ext3 specific. It was
|
||||
design to add journaling capabilities on a block device. The ext3
|
||||
filesystem code will inform the JBD of modifications it is performing
|
||||
(Call a transaction). the journal support the transactions start and
|
||||
stop, and in case of crash, the journal can replayed the transactions
|
||||
to put the partition on a consistent state fastly.
|
||||
|
||||
handles represent a single atomic update to a filesystem. JBD can
|
||||
handle external journal on a block device.
|
||||
|
||||
Data Mode
|
||||
---------
|
||||
There's 3 different data modes:
|
||||
|
||||
* writeback mode
|
||||
In data=writeback mode, ext3 does not journal data at all. This mode
|
||||
provides a similar level of journaling as XFS, JFS, and ReiserFS in its
|
||||
default mode - metadata journaling. A crash+recovery can cause
|
||||
incorrect data to appear in files which were written shortly before the
|
||||
crash. This mode will typically provide the best ext3 performance.
|
||||
|
||||
* ordered mode
|
||||
In data=ordered mode, ext3 only officially journals metadata, but it
|
||||
logically groups metadata and data blocks into a single unit called a
|
||||
transaction. When it's time to write the new metadata out to disk, the
|
||||
associated data blocks are written first. In general, this mode
|
||||
perform slightly slower than writeback but significantly faster than
|
||||
journal mode.
|
||||
|
||||
* journal mode
|
||||
data=journal mode provides full data and metadata journaling. All new
|
||||
data is written to the journal first, and then to its final location.
|
||||
In the event of a crash, the journal can be replayed, bringing both
|
||||
data and metadata into a consistent state. This mode is the slowest
|
||||
except when data needs to be read from and written to disk at the same
|
||||
time where it outperform all others mode.
|
||||
|
||||
Compatibility
|
||||
-------------
|
||||
|
||||
Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
|
||||
Ext3 is fully compatible with Ext2. Ext3 partitions can easily be
|
||||
mounted as Ext2.
|
||||
|
||||
External Tools
|
||||
==============
|
||||
see manual pages to know more.
|
||||
|
||||
tune2fs: create a ext3 journal on a ext2 partition with the -j flags
|
||||
mke2fs: create a ext3 partition with the -j flags
|
||||
debugfs: ext2 and ext3 file system debugger
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
kernel source: file:/usr/src/linux/fs/ext3
|
||||
file:/usr/src/linux/fs/jbd
|
||||
|
||||
programs: http://e2fsprogs.sourceforge.net
|
||||
|
||||
useful link:
|
||||
http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
|
||||
http://www-106.ibm.com/developerworks/linux/library/l-fs7/
|
||||
http://www-106.ibm.com/developerworks/linux/library/l-fs8/
|
83
Documentation/filesystems/hfs.txt
Normal file
83
Documentation/filesystems/hfs.txt
Normal file
@@ -0,0 +1,83 @@
|
||||
|
||||
Macintosh HFS Filesystem for Linux
|
||||
==================================
|
||||
|
||||
HFS stands for ``Hierarchical File System'' and is the filesystem used
|
||||
by the Mac Plus and all later Macintosh models. Earlier Macintosh
|
||||
models used MFS (``Macintosh File System''), which is not supported,
|
||||
MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
|
||||
HFS but is extended in various areas. Use the hfsplus filesystem driver
|
||||
to access such filesystems from Linux.
|
||||
|
||||
|
||||
Mount options
|
||||
=============
|
||||
|
||||
When mounting an HFS filesystem, the following options are accepted:
|
||||
|
||||
creator=cccc, type=cccc
|
||||
Specifies the creator/type values as shown by the MacOS finder
|
||||
used for creating new files. Default values: '????'.
|
||||
|
||||
uid=n, gid=n
|
||||
Specifies the user/group that owns all files on the filesystems.
|
||||
Default: user/group id of the mounting process.
|
||||
|
||||
dir_umask=n, file_umask=n, umask=n
|
||||
Specifies the umask used for all files , all directories or all
|
||||
files and directories. Defaults to the umask of the mounting process.
|
||||
|
||||
session=n
|
||||
Select the CDROM session to mount as HFS filesystem. Defaults to
|
||||
leaving that decision to the CDROM driver. This option will fail
|
||||
with anything but a CDROM as underlying devices.
|
||||
|
||||
part=n
|
||||
Select partition number n from the devices. Does only makes
|
||||
sense for CDROMS because they can't be partitioned under Linux.
|
||||
For disk devices the generic partition parsing code does this
|
||||
for us. Defaults to not parsing the partition table at all.
|
||||
|
||||
quiet
|
||||
Ignore invalid mount options instead of complaining.
|
||||
|
||||
|
||||
Writing to HFS Filesystems
|
||||
==========================
|
||||
|
||||
HFS is not a UNIX filesystem, thus it does not have the usual features you'd
|
||||
expect:
|
||||
|
||||
o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
|
||||
and gid of files.
|
||||
o You can't create hard- or symlinks, device files, sockets or FIFOs.
|
||||
|
||||
HFS does on the other have the concepts of multiple forks per file. These
|
||||
non-standard forks are represented as hidden additional files in the normal
|
||||
filesystems namespace which is kind of a cludge and makes the semantics for
|
||||
the a little strange:
|
||||
|
||||
o You can't create, delete or rename resource forks of files or the
|
||||
Finder's metadata.
|
||||
o They are however created (with default values), deleted and renamed
|
||||
along with the corresponding data fork or directory.
|
||||
o Copying files to a different filesystem will loose those attributes
|
||||
that are essential for MacOS to work.
|
||||
|
||||
|
||||
Creating HFS filesystems
|
||||
===================================
|
||||
|
||||
The hfsutils package from Robert Leslie contains a program called
|
||||
hformat that can be used to create HFS filesystem. See
|
||||
<http://www.mars.org/home/rob/proj/hfs/> for details.
|
||||
|
||||
|
||||
Credits
|
||||
=======
|
||||
|
||||
The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU)
|
||||
and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis
|
||||
Technologies.
|
||||
Roman rewrote large parts of the code and brought in btree routines derived
|
||||
from Brad Boyer's hfsplus driver (also maintained by Roman now).
|
296
Documentation/filesystems/hpfs.txt
Normal file
296
Documentation/filesystems/hpfs.txt
Normal file
@@ -0,0 +1,296 @@
|
||||
Read/Write HPFS 2.09
|
||||
1998-2004, Mikulas Patocka
|
||||
|
||||
email: mikulas@artax.karlin.mff.cuni.cz
|
||||
homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
|
||||
|
||||
CREDITS:
|
||||
Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
|
||||
is taken from it
|
||||
Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
|
||||
Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
|
||||
|
||||
Mount options
|
||||
|
||||
uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask)
|
||||
Set owner/group/mode for files that do not have it specified in extended
|
||||
attributes. Mode is inverted umask - for example umask 027 gives owner
|
||||
all permission, group read permission and anybody else no access. Note
|
||||
that for files mode is anded with 0666. If you want files to have 'x'
|
||||
rights, you must use extended attributes.
|
||||
case=lower,asis (default asis)
|
||||
File name lowercasing in readdir.
|
||||
conv=binary,text,auto (default binary)
|
||||
CR/LF -> LF conversion, if auto, decision is made according to extension
|
||||
- there is a list of text extensions (I thing it's better to not convert
|
||||
text file than to damage binary file). If you want to change that list,
|
||||
change it in the source. Original readonly HPFS contained some strange
|
||||
heuristic algorithm that I removed. I thing it's danger to let the
|
||||
computer decide whether file is text or binary. For example, DJGPP
|
||||
binaries contain small text message at the beginning and they could be
|
||||
misidentified and damaged under some circumstances.
|
||||
check=none,normal,strict (default normal)
|
||||
Check level. Selecting none will cause only little speedup and big
|
||||
danger. I tried to write it so that it won't crash if check=normal on
|
||||
corrupted filesystems. check=strict means many superfluous checks -
|
||||
used for debugging (for example it checks if file is allocated in
|
||||
bitmaps when accessing it).
|
||||
errors=continue,remount-ro,panic (default remount-ro)
|
||||
Behaviour when filesystem errors found.
|
||||
chkdsk=no,errors,always (default errors)
|
||||
When to mark filesystem dirty so that OS/2 checks it.
|
||||
eas=no,ro,rw (default rw)
|
||||
What to do with extended attributes. 'no' - ignore them and use always
|
||||
values specified in uid/gid/mode options. 'ro' - read extended
|
||||
attributes but do not create them. 'rw' - create extended attributes
|
||||
when you use chmod/chown/chgrp/mknod/ln -s on the filesystem.
|
||||
timeshift=(-)nnn (default 0)
|
||||
Shifts the time by nnn seconds. For example, if you see under linux
|
||||
one hour more, than under os/2, use timeshift=-3600.
|
||||
|
||||
|
||||
File names
|
||||
|
||||
As in OS/2, filenames are case insensitive. However, shell thinks that names
|
||||
are case sensitive, so for example when you create a file FOO, you can use
|
||||
'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
|
||||
also won't be able to compile linux kernel (and maybe other things) on HPFS
|
||||
because kernel creates different files with names like bootsect.S and
|
||||
bootsect.s. When searching for file thats name has characters >= 128, codepages
|
||||
are used - see below.
|
||||
OS/2 ignores dots and spaces at the end of file name, so this driver does as
|
||||
well. If you create 'a. ...', the file 'a' will be created, but you can still
|
||||
access it under names 'a.', 'a..', 'a . . . ' etc.
|
||||
|
||||
|
||||
Extended attributes
|
||||
|
||||
On HPFS partitions, OS/2 can associate to each file a special information called
|
||||
extended attributes. Extended attributes are pairs of (key,value) where key is
|
||||
an ascii string identifying that attribute and value is any string of bytes of
|
||||
variable length. OS/2 stores window and icon positions and file types there. So
|
||||
why not use it for unix-specific info like file owner or access rights? This
|
||||
driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended
|
||||
attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only
|
||||
that extended attributes those value differs from defaults specified in mount
|
||||
options are created. Once created, the extended attributes are never deleted,
|
||||
they're just changed. It means that when your default uid=0 and you type
|
||||
something like 'chown luser file; chown root file' the file will contain
|
||||
extended attribute UID=0. And when you umount the fs and mount it again with
|
||||
uid=luser_uid, the file will be still owned by root! If you chmod file to 444,
|
||||
extended attribute "MODE" will not be set, this special case is done by setting
|
||||
read-only flag. When you mknod a block or char device, besides "MODE", the
|
||||
special 4-byte extended attribute "DEV" will be created containing the device
|
||||
number. Currently this driver cannot resize extended attributes - it means
|
||||
that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV"
|
||||
attributes with different sizes, they won't be rewritten and changing these
|
||||
values doesn't work.
|
||||
|
||||
|
||||
Symlinks
|
||||
|
||||
You can do symlinks on HPFS partition, symlinks are achieved by setting extended
|
||||
attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
|
||||
chgrp symlinks but I don't know what is it good for. chmoding symlink results
|
||||
in chmoding file where symlink points. These symlinks are just for Linux use and
|
||||
incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are
|
||||
stored in very crazy way. They tried to do it so that link changes when file is
|
||||
moved ... sometimes it works. But the link is partly stored in directory
|
||||
extended attributes and partly in OS2SYS.INI. I don't want (and don't know how)
|
||||
to analyze or change OS2SYS.INI.
|
||||
|
||||
|
||||
Codepages
|
||||
|
||||
HPFS can contain several uppercasing tables for several codepages and each
|
||||
file has a pointer to codepage it's name is in. However OS/2 was created in
|
||||
America where people don't care much about codepages and so multiple codepages
|
||||
support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
|
||||
Once I booted English OS/2 working in cp 850 and I created a file on my 852
|
||||
partition. It marked file name codepage as 850 - good. But when I again booted
|
||||
Czech OS/2, the file was completely inaccessible under any name. It seems that
|
||||
OS/2 uppercases the search pattern with its system code page (852) and file
|
||||
name it's comparing to with its code page (850). These could never match. Is it
|
||||
really what IBM developers wanted? But problems continued. When I created in
|
||||
Czech OS/2 another file in that directory, that file was inaccessible too. OS/2
|
||||
probably uses different uppercasing method when searching where to place a file
|
||||
(note, that files in HPFS directory must be sorted) and when searching for
|
||||
a file. Finally when I opened this directory in PmShell, PmShell crashed (the
|
||||
funny thing was that, when rebooted, PmShell tried to reopen this directory
|
||||
again :-). chkdsk happily ignores these errors and only low-level disk
|
||||
modification saved me. Never mix different language versions of OS/2 on one
|
||||
system although HPFS was designed to allow that.
|
||||
OK, I could implement complex codepage support to this driver but I think it
|
||||
would cause more problems than benefit with such buggy implementation in OS/2.
|
||||
So this driver simply uses first codepage it finds for uppercasing and
|
||||
lowercasing no matter what's file codepage index. Usually all file names are in
|
||||
this codepage - if you don't try to do what I described above :-)
|
||||
|
||||
|
||||
Known bugs
|
||||
|
||||
HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
|
||||
should work. If you have OS/2 server, use only read-only mode. I don't know how
|
||||
to handle some HPFS386 structures like access control list or extended perm
|
||||
list, I don't know how to delete them when file is deleted and how to not
|
||||
overwrite them with extended attributes. Send me some info on these structures
|
||||
and I'll make it. However, this driver should detect presence of HPFS386
|
||||
structures, remount read-only and not destroy them (I hope).
|
||||
|
||||
When there's not enough space for extended attributes, they will be truncated
|
||||
and no error is returned.
|
||||
|
||||
OS/2 can't access files if the path is longer than about 256 chars but this
|
||||
driver allows you to do it. chkdsk ignores such errors.
|
||||
|
||||
Sometimes you won't be able to delete some files on a very full filesystem
|
||||
(returning error ENOSPC). That's because file in non-leaf node in directory tree
|
||||
(one directory, if it's large, has dirents in tree on HPFS) must be replaced
|
||||
with another node when deleted. And that new file might have larger name than
|
||||
the old one so the new name doesn't fit in directory node (dnode). And that
|
||||
would result in directory tree splitting, that takes disk space. Workaround is
|
||||
to delete other files that are leaf (probability that the file is non-leaf is
|
||||
about 1/50) or to truncate file first to make some space.
|
||||
You encounter this problem only if you have many directories so that
|
||||
preallocated directory band is full i.e.
|
||||
number_of_directories / size_of_filesystem_in_mb > 4.
|
||||
|
||||
You can't delete open directories.
|
||||
|
||||
You can't rename over directories (what is it good for?).
|
||||
|
||||
Renaming files so that only case changes doesn't work. This driver supports it
|
||||
but vfs doesn't. Something like 'mv file FILE' won't work.
|
||||
|
||||
All atimes and directory mtimes are not updated. That's because of performance
|
||||
reasons. If you extremely wish to update them, let me know, I'll write it (but
|
||||
it will be slow).
|
||||
|
||||
When the system is out of memory and swap, it may slightly corrupt filesystem
|
||||
(lost files, unbalanced directories). (I guess all filesystem may do it).
|
||||
|
||||
When compiled, you get warning: function declaration isn't a prototype. Does
|
||||
anybody know what does it mean?
|
||||
|
||||
|
||||
What does "unbalanced tree" message mean?
|
||||
|
||||
Old versions of this driver created sometimes unbalanced dnode trees. OS/2
|
||||
chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
|
||||
unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely
|
||||
crashes when the tree is not balanced. This driver handles unbalanced trees
|
||||
correctly and writes warning if it finds them. If you see this message, this is
|
||||
probably because of directories created with old version of this driver.
|
||||
Workaround is to move all files from that directory to another and then back
|
||||
again. Do it in Linux, not OS/2! If you see this message in directory that is
|
||||
whole created by this driver, it is BUG - let me know about it.
|
||||
|
||||
|
||||
Bugs in OS/2
|
||||
|
||||
When you have two (or more) lost directories pointing each to other, chkdsk
|
||||
locks up when repairing filesystem.
|
||||
|
||||
Sometimes (I think it's random) when you create a file with one-char name under
|
||||
OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs
|
||||
error corrected".
|
||||
|
||||
File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
|
||||
marks them as short (and writes "minor fs error corrected"). This bug is not in
|
||||
HPFS386.
|
||||
|
||||
Codepage bugs described above.
|
||||
|
||||
If you don't install fixpacks, there are many, many more...
|
||||
|
||||
|
||||
History
|
||||
|
||||
0.90 First public release
|
||||
0.91 Fixed bug that caused shooting to memory when write_inode was called on
|
||||
open inode (rarely happened)
|
||||
0.92 Fixed a little memory leak in freeing directory inodes
|
||||
0.93 Fixed bug that locked up the machine when there were too many filenames
|
||||
with first 15 characters same
|
||||
Fixed write_file to zero file when writing behind file end
|
||||
0.94 Fixed a little memory leak when trying to delete busy file or directory
|
||||
0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
|
||||
1.90 First version for 2.1.1xx kernels
|
||||
1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
|
||||
Fixed a race-condition when write_inode is called while deleting file
|
||||
Fixed a bug that could possibly happen (with very low probability) when
|
||||
using 0xff in filenames
|
||||
Rewritten locking to avoid race-conditions
|
||||
Mount option 'eas' now works
|
||||
Fsync no longer returns error
|
||||
Files beginning with '.' are marked hidden
|
||||
Remount support added
|
||||
Alloc is not so slow when filesystem becomes full
|
||||
Atimes are no more updated because it slows down operation
|
||||
Code cleanup (removed all commented debug prints)
|
||||
1.92 Corrected a bug when sync was called just before closing file
|
||||
1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
|
||||
works with previous versions
|
||||
Fixed a possible problem with disks > 64G (but I don't have one, so I can't
|
||||
test it)
|
||||
Fixed a file overflow at 2G
|
||||
Added new option 'timeshift'
|
||||
Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
|
||||
read-only mode
|
||||
Fixed a bug that slowed down alloc and prevented allocating 100% space
|
||||
(this bug was not destructive)
|
||||
1.94 Added workaround for one bug in Linux
|
||||
Fixed one buffer leak
|
||||
Fixed some incompatibilities with large extended attributes (but it's still
|
||||
not 100% ok, I have no info on it and OS/2 doesn't want to create them)
|
||||
Rewritten allocation
|
||||
Fixed a bug with i_blocks (du sometimes didn't display correct values)
|
||||
Directories have no longer archive attribute set (some programs don't like
|
||||
it)
|
||||
Fixed a bug that it set badly one flag in large anode tree (it was not
|
||||
destructive)
|
||||
1.95 Fixed one buffer leak, that could happen on corrupted filesystem
|
||||
Fixed one bug in allocation in 1.94
|
||||
1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
|
||||
error sometimes when opening directories in PMSHELL)
|
||||
Fixed a possible bitmap race
|
||||
Fixed possible problem on large disks
|
||||
You can now delete open files
|
||||
Fixed a nondestructive race in rename
|
||||
1.97 Support for HPFS v3 (on large partitions)
|
||||
Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
|
||||
1.97.1 Changed names of global symbols
|
||||
Fixed a bug when chmoding or chowning root directory
|
||||
1.98 Fixed a deadlock when using old_readdir
|
||||
Better directory handling; workaround for "unbalanced tree" bug in OS/2
|
||||
1.99 Corrected a possible problem when there's not enough space while deleting
|
||||
file
|
||||
Now it tries to truncate the file if there's not enough space when deleting
|
||||
Removed a lot of redundant code
|
||||
2.00 Fixed a bug in rename (it was there since 1.96)
|
||||
Better anti-fragmentation strategy
|
||||
2.01 Fixed problem with directory listing over NFS
|
||||
Directory lseek now checks for proper parameters
|
||||
Fixed race-condition in buffer code - it is in all filesystems in Linux;
|
||||
when reading device (cat /dev/hda) while creating files on it, files
|
||||
could be damaged
|
||||
2.02 Woraround for bug in breada in Linux. breada could cause accesses beyond
|
||||
end of partition
|
||||
2.03 Char, block devices and pipes are correctly created
|
||||
Fixed non-crashing race in unlink (Alexander Viro)
|
||||
Now it works with Japanese version of OS/2
|
||||
2.04 Fixed error when ftruncate used to extend file
|
||||
2.05 Fixed crash when got mount parameters without =
|
||||
Fixed crash when allocation of anode failed due to full disk
|
||||
Fixed some crashes when block io or inode allocation failed
|
||||
2.06 Fixed some crash on corrupted disk structures
|
||||
Better allocation strategy
|
||||
Reschedule points added so that it doesn't lock CPU long time
|
||||
It should work in read-only mode on Warp Server
|
||||
2.07 More fixes for Warp Server. Now it really works
|
||||
2.08 Creating new files is not so slow on large disks
|
||||
An attempt to sync deleted file does not generate filesystem error
|
||||
2.09 Fixed error on extremly fragmented files
|
||||
|
||||
|
||||
vim: set textwidth=80:
|
38
Documentation/filesystems/isofs.txt
Normal file
38
Documentation/filesystems/isofs.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
Mount options that are the same as for msdos and vfat partitions.
|
||||
|
||||
gid=nnn All files in the partition will be in group nnn.
|
||||
uid=nnn All files in the partition will be owned by user id nnn.
|
||||
umask=nnn The permission mask (see umask(1)) for the partition.
|
||||
|
||||
Mount options that are the same as vfat partitions. These are only useful
|
||||
when using discs encoded using Microsoft's Joliet extensions.
|
||||
iocharset=name Character set to use for converting from Unicode to
|
||||
ASCII. Joliet filenames are stored in Unicode format, but
|
||||
Unix for the most part doesn't know how to deal with Unicode.
|
||||
There is also an option of doing UTF8 translations with the
|
||||
utf8 option.
|
||||
utf8 Encode Unicode names in UTF8 format. Default is no.
|
||||
|
||||
Mount options unique to the isofs filesystem.
|
||||
block=512 Set the block size for the disk to 512 bytes
|
||||
block=1024 Set the block size for the disk to 1024 bytes
|
||||
block=2048 Set the block size for the disk to 2048 bytes
|
||||
check=relaxed Matches filenames with different cases
|
||||
check=strict Matches only filenames with the exact same case
|
||||
cruft Try to handle badly formatted CDs.
|
||||
map=off Do not map non-Rock Ridge filenames to lower case
|
||||
map=normal Map non-Rock Ridge filenames to lower case
|
||||
map=acorn As map=normal but also apply Acorn extensions if present
|
||||
mode=xxx Sets the permissions on files to xxx
|
||||
nojoliet Ignore Joliet extensions if they are present.
|
||||
norock Ignore Rock Ridge extensions if they are present.
|
||||
unhide Show hidden files.
|
||||
session=x Select number of session on multisession CD
|
||||
sbsector=xxx Session begins from sector xxx
|
||||
|
||||
Recommended documents about ISO 9660 standard are located at:
|
||||
http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm
|
||||
ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
|
||||
Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
|
||||
identical with ISO 9660.", so it is a valid and gratis substitute of the
|
||||
official ISO specification.
|
35
Documentation/filesystems/jfs.txt
Normal file
35
Documentation/filesystems/jfs.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
IBM's Journaled File System (JFS) for Linux
|
||||
|
||||
JFS Homepage: http://jfs.sourceforge.net/
|
||||
|
||||
The following mount options are supported:
|
||||
|
||||
iocharset=name Character set to use for converting from Unicode to
|
||||
ASCII. The default is to do no conversion. Use
|
||||
iocharset=utf8 for UTF8 translations. This requires
|
||||
CONFIG_NLS_UTF8 to be set in the kernel .config file.
|
||||
iocharset=none specifies the default behavior explicitly.
|
||||
|
||||
resize=value Resize the volume to <value> blocks. JFS only supports
|
||||
growing a volume, not shrinking it. This option is only
|
||||
valid during a remount, when the volume is mounted
|
||||
read-write. The resize keyword with no value will grow
|
||||
the volume to the full size of the partition.
|
||||
|
||||
nointegrity Do not write to the journal. The primary use of this option
|
||||
is to allow for higher performance when restoring a volume
|
||||
from backup media. The integrity of the volume is not
|
||||
guaranteed if the system abnormally abends.
|
||||
|
||||
integrity Default. Commit metadata changes to the journal. Use this
|
||||
option to remount a volume where the nointegrity option was
|
||||
previously specified in order to restore normal behavior.
|
||||
|
||||
errors=continue Keep going on a filesystem error.
|
||||
errors=remount-ro Default. Remount the filesystem read-only on an error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
|
||||
Please send bugs, comments, cards and letters to shaggy@austin.ibm.com.
|
||||
|
||||
The JFS mailing list can be subscribed to by using the link labeled
|
||||
"Mail list Subscribe" at our web page http://jfs.sourceforge.net/
|
12
Documentation/filesystems/ncpfs.txt
Normal file
12
Documentation/filesystems/ncpfs.txt
Normal file
@@ -0,0 +1,12 @@
|
||||
The ncpfs filesystem understands the NCP protocol, designed by the
|
||||
Novell Corporation for their NetWare(tm) product. NCP is functionally
|
||||
similar to the NFS used in the TCP/IP community.
|
||||
To mount a NetWare filesystem, you need a special mount program, which
|
||||
can be found in the ncpfs package. The home site for ncpfs is
|
||||
ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors
|
||||
will have it as well.
|
||||
|
||||
Related products are linware and mars_nwe, which will give Linux partial
|
||||
NetWare server functionality. Linware's home site is
|
||||
klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on
|
||||
ftp.gwdg.de/pub/linux/misc/ncpfs.
|
630
Documentation/filesystems/ntfs.txt
Normal file
630
Documentation/filesystems/ntfs.txt
Normal file
@@ -0,0 +1,630 @@
|
||||
The Linux NTFS filesystem driver
|
||||
================================
|
||||
|
||||
|
||||
Table of contents
|
||||
=================
|
||||
|
||||
- Overview
|
||||
- Web site
|
||||
- Features
|
||||
- Supported mount options
|
||||
- Known bugs and (mis-)features
|
||||
- Using NTFS volume and stripe sets
|
||||
- The Device-Mapper driver
|
||||
- The Software RAID / MD driver
|
||||
- Limitiations when using the MD driver
|
||||
- ChangeLog
|
||||
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
|
||||
These include mkntfs, a full-featured ntfs file system format utility,
|
||||
ntfsundelete used for recovering files that were unintentionally deleted
|
||||
from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
|
||||
See the web site for more information.
|
||||
|
||||
To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
|
||||
system type 'ntfs'. The driver currently supports read-only mode (with no
|
||||
fault-tolerance, encryption or journalling) and very limited, but safe, write
|
||||
support.
|
||||
|
||||
For fault tolerance and raid support (i.e. volume and stripe sets), you can
|
||||
use the kernel's Software RAID / MD driver. See section "Using Software RAID
|
||||
with NTFS" for details.
|
||||
|
||||
|
||||
Web site
|
||||
========
|
||||
|
||||
There is plenty of additional information on the linux-ntfs web site
|
||||
at http://linux-ntfs.sourceforge.net/
|
||||
|
||||
The web site has a lot of additional information, such as a comprehensive
|
||||
FAQ, documentation on the NTFS on-disk format, informaiton on the Linux-NTFS
|
||||
userspace utilities, etc.
|
||||
|
||||
|
||||
Features
|
||||
========
|
||||
|
||||
- This is a complete rewrite of the NTFS driver that used to be in the kernel.
|
||||
This new driver implements NTFS read support and is functionally equivalent
|
||||
to the old ntfs driver.
|
||||
- The new driver has full support for sparse files on NTFS 3.x volumes which
|
||||
the old driver isn't happy with.
|
||||
- The new driver supports execution of binaries due to mmap() now being
|
||||
supported.
|
||||
- The new driver supports loopback mounting of files on NTFS which is used by
|
||||
some Linux distributions to enable the user to run Linux from an NTFS
|
||||
partition by creating a large file while in Windows and then loopback
|
||||
mounting the file while in Linux and creating a Linux filesystem on it that
|
||||
is used to install Linux on it.
|
||||
- A comparison of the two drivers using:
|
||||
time find . -type f -exec md5sum "{}" \;
|
||||
run three times in sequence with each driver (after a reboot) on a 1.4GiB
|
||||
NTFS partition, showed the new driver to be 20% faster in total time elapsed
|
||||
(from 9:43 minutes on average down to 7:53). The time spent in user space
|
||||
was unchanged but the time spent in the kernel was decreased by a factor of
|
||||
2.5 (from 85 CPU seconds down to 33).
|
||||
- The driver does not support short file names in general. For backwards
|
||||
compatibility, we implement access to files using their short file names if
|
||||
they exist. The driver will not create short file names however, and a
|
||||
rename will discard any existing short file name.
|
||||
- The new driver supports exporting of mounted NTFS volumes via NFS.
|
||||
- The new driver supports async io (aio).
|
||||
- The new driver supports fsync(2), fdatasync(2), and msync(2).
|
||||
- The new driver supports readv(2) and writev(2).
|
||||
- The new driver supports access time updates (including mtime and ctime).
|
||||
|
||||
|
||||
Supported mount options
|
||||
=======================
|
||||
|
||||
In addition to the generic mount options described by the manual page for the
|
||||
mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
|
||||
following mount options:
|
||||
|
||||
iocharset=name Deprecated option. Still supported but please use
|
||||
nls=name in the future. See description for nls=name.
|
||||
|
||||
nls=name Character set to use when returning file names.
|
||||
Unlike VFAT, NTFS suppresses names that contain
|
||||
unconvertible characters. Note that most character
|
||||
sets contain insufficient characters to represent all
|
||||
possible Unicode characters that can exist on NTFS.
|
||||
To be sure you are not missing any files, you are
|
||||
advised to use nls=utf8 which is capable of
|
||||
representing all Unicode characters.
|
||||
|
||||
utf8=<bool> Option no longer supported. Currently mapped to
|
||||
nls=utf8 but please use nls=utf8 in the future and
|
||||
make sure utf8 is compiled either as module or into
|
||||
the kernel. See description for nls=name.
|
||||
|
||||
uid=
|
||||
gid=
|
||||
umask= Provide default owner, group, and access mode mask.
|
||||
These options work as documented in mount(8). By
|
||||
default, the files/directories are owned by root and
|
||||
he/she has read and write permissions, as well as
|
||||
browse permission for directories. No one else has any
|
||||
access permissions. I.e. the mode on all files is by
|
||||
default rw------- and for directories rwx------, a
|
||||
consequence of the default fmask=0177 and dmask=0077.
|
||||
Using a umask of zero will grant all permissions to
|
||||
everyone, i.e. all files and directories will have mode
|
||||
rwxrwxrwx.
|
||||
|
||||
fmask=
|
||||
dmask= Instead of specifying umask which applies both to
|
||||
files and directories, fmask applies only to files and
|
||||
dmask only to directories.
|
||||
|
||||
sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
|
||||
Otherwise the default behaviour is to abort mount if
|
||||
any unknown options are found.
|
||||
|
||||
show_sys_files=<BOOL> If show_sys_files is specified, show the system files
|
||||
in directory listings. Otherwise the default behaviour
|
||||
is to hide the system files.
|
||||
Note that even when show_sys_files is specified, "$MFT"
|
||||
will not be visible due to bugs/mis-features in glibc.
|
||||
Further, note that irrespective of show_sys_files, all
|
||||
files are accessible by name, i.e. you can always do
|
||||
"ls -l \$UpCase" for example to specifically show the
|
||||
system file containing the Unicode upcase table.
|
||||
|
||||
case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
|
||||
case sensitive and create file names in the POSIX
|
||||
namespace. Otherwise the default behaviour is to treat
|
||||
file names as case insensitive and to create file names
|
||||
in the WIN32/LONG name space. Note, the Linux NTFS
|
||||
driver will never create short file names and will
|
||||
remove them on rename/delete of the corresponding long
|
||||
file name.
|
||||
Note that files remain accessible via their short file
|
||||
name, if it exists. If case_sensitive, you will need
|
||||
to provide the correct case of the short file name.
|
||||
|
||||
errors=opt What to do when critical file system errors are found.
|
||||
Following values can be used for "opt":
|
||||
continue: DEFAULT, try to clean-up as much as
|
||||
possible, e.g. marking a corrupt inode as
|
||||
bad so it is no longer accessed, and then
|
||||
continue.
|
||||
recover: At present only supported is recovery of
|
||||
the boot sector from the backup copy.
|
||||
If read-only mount, the recovery is done
|
||||
in memory only and not written to disk.
|
||||
Note that the options are additive, i.e. specifying:
|
||||
errors=continue,errors=recover
|
||||
means the driver will attempt to recover and if that
|
||||
fails it will clean-up as much as possible and
|
||||
continue.
|
||||
|
||||
mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
|
||||
setting is not persistent across mounts and can be
|
||||
changed from mount to mount but cannot be changed on
|
||||
remount). Values of 1 to 4 are allowed, 1 being the
|
||||
default. The MFT zone multiplier determines how much
|
||||
space is reserved for the MFT on the volume. If all
|
||||
other space is used up, then the MFT zone will be
|
||||
shrunk dynamically, so this has no impact on the
|
||||
amount of free space. However, it can have an impact
|
||||
on performance by affecting fragmentation of the MFT.
|
||||
In general use the default. If you have a lot of small
|
||||
files then use a higher value. The values have the
|
||||
following meaning:
|
||||
Value MFT zone size (% of volume size)
|
||||
1 12.5%
|
||||
2 25%
|
||||
3 37.5%
|
||||
4 50%
|
||||
Note this option is irrelevant for read-only mounts.
|
||||
|
||||
|
||||
Known bugs and (mis-)features
|
||||
=============================
|
||||
|
||||
- The link count on each directory inode entry is set to 1, due to Linux not
|
||||
supporting directory hard links. This may well confuse some user space
|
||||
applications, since the directory names will have the same inode numbers.
|
||||
This also speeds up ntfs_read_inode() immensely. And we haven't found any
|
||||
problems with this approach so far. If you find a problem with this, please
|
||||
let us know.
|
||||
|
||||
|
||||
Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
|
||||
list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
|
||||
|
||||
|
||||
Using NTFS volume and stripe sets
|
||||
=================================
|
||||
|
||||
For support of volume and stripe sets, you can either use the kernel's
|
||||
Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
|
||||
the recommended one to use for linear raid. But the latter is required for
|
||||
raid level 5. For striping and mirroring, either driver should work fine.
|
||||
|
||||
|
||||
The Device-Mapper driver
|
||||
------------------------
|
||||
|
||||
You will need to create a table of the components of the volume/stripe set and
|
||||
how they fit together and load this into the kernel using the dmsetup utility
|
||||
(see man 8 dmsetup).
|
||||
|
||||
Linear volume sets, i.e. linear raid, has been tested and works fine. Even
|
||||
though untested, there is no reason why stripe sets, i.e. raid level 0, and
|
||||
mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
|
||||
raid level 5, unfortunately cannot work yet because the current version of the
|
||||
Device-Mapper driver does not support raid level 5. You may be able to use the
|
||||
Software RAID / MD driver for raid level 5, see the next section for details.
|
||||
|
||||
To create the table describing your volume you will need to know each of its
|
||||
components and their sizes in sectors, i.e. multiples of 512-byte blocks.
|
||||
|
||||
For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
|
||||
example if one of your partitions is /dev/hda2 you would do:
|
||||
|
||||
$ fdisk -ul /dev/hda
|
||||
|
||||
Disk /dev/hda: 81.9 GB, 81964302336 bytes
|
||||
255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
|
||||
Units = sectors of 1 * 512 = 512 bytes
|
||||
|
||||
Device Boot Start End Blocks Id System
|
||||
/dev/hda1 * 63 4209029 2104483+ 83 Linux
|
||||
/dev/hda2 4209030 37768814 16779892+ 86 NTFS
|
||||
/dev/hda3 37768815 46170809 4200997+ 83 Linux
|
||||
|
||||
And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
|
||||
33559785 sectors.
|
||||
|
||||
For Win2k and later dynamic disks, you can for example use the ldminfo utility
|
||||
which is part of the Linux LDM tools (the latest version at the time of
|
||||
writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
|
||||
http://linux-ntfs.sourceforge.net/downloads.html
|
||||
Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
|
||||
into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
|
||||
will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
|
||||
able to compile this yourself easily so use the binary version!
|
||||
|
||||
Then you would use ldminfo in dump mode to obtain the necessary information:
|
||||
|
||||
$ ./ldminfo --dump /dev/hda
|
||||
|
||||
This would dump the LDM database found on /dev/hda which describes all of your
|
||||
dynamic disks and all the volumes on them. At the bottom you will see the
|
||||
VOLUME DEFINITIONS section which is all you really need. You may need to look
|
||||
further above to determine which of the disks in the volume definitions is
|
||||
which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
|
||||
look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
|
||||
section). You can then find these Disk Ids in the VBLK DATABASE section in the
|
||||
<Disk> components where you will get the LDM Name for the disk that is found in
|
||||
the VOLUME DEFINITIONS section.
|
||||
|
||||
Note you will also need to enable the LDM driver in the Linux kernel. If your
|
||||
distribution did not enable it, you will need to recompile the kernel with it
|
||||
enabled. This will create the LDM partitions on each device at boot time. You
|
||||
would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
|
||||
in the Device-Mapper table.
|
||||
|
||||
You can also bypass using the LDM driver by using the main device (e.g.
|
||||
/dev/hda) and then using the offsets of the LDM partitions into this device as
|
||||
the "Start sector of device" when creating the table. Once again ldminfo would
|
||||
give you the correct information to do this.
|
||||
|
||||
Assuming you know all your devices and their sizes things are easy.
|
||||
|
||||
For a linear raid the table would look like this (note all values are in
|
||||
512-byte sectors):
|
||||
|
||||
--- cut here ---
|
||||
# Offset into Size of this Raid type Device Start sector
|
||||
# volume device of device
|
||||
0 1028161 linear /dev/hda1 0
|
||||
1028161 3903762 linear /dev/hdb2 0
|
||||
4931923 2103211 linear /dev/hdc1 0
|
||||
--- cut here ---
|
||||
|
||||
For a striped volume, i.e. raid level 0, you will need to know the chunk size
|
||||
you used when creating the volume. Windows uses 64kiB as the default, so it
|
||||
will probably be this unless you changes the defaults when creating the array.
|
||||
|
||||
For a raid level 0 the table would look like this (note all values are in
|
||||
512-byte sectors):
|
||||
|
||||
--- cut here ---
|
||||
# Offset Size Raid Number Chunk 1st Start 2nd Start
|
||||
# into of the type of size Device in Device in
|
||||
# volume volume stripes device device
|
||||
0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
|
||||
--- cut here ---
|
||||
|
||||
If there are more than two devices, just add each of them to the end of the
|
||||
line.
|
||||
|
||||
Finally, for a mirrored volume, i.e. raid level 1, the table would look like
|
||||
this (note all values are in 512-byte sectors):
|
||||
|
||||
--- cut here ---
|
||||
# Ofs Size Raid Log Number Region Should Number Source Start Taget Start
|
||||
# in of the type type of log size sync? of Device in Device in
|
||||
# vol volume params mirrors Device Device
|
||||
0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
|
||||
--- cut here ---
|
||||
|
||||
If you are mirroring to multiple devices you can specify further targets at the
|
||||
end of the line.
|
||||
|
||||
Note the "Should sync?" parameter "nosync" means that the two mirrors are
|
||||
already in sync which will be the case on a clean shutdown of Windows. If the
|
||||
mirrors are not clean, you can specify the "sync" option instead of "nosync"
|
||||
and the Device-Mapper driver will then copy the entirey of the "Source Device"
|
||||
to the "Target Device" or if you specified multipled target devices to all of
|
||||
them.
|
||||
|
||||
Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
|
||||
and hand it over to dmsetup to work with, like so:
|
||||
|
||||
$ dmsetup create myvolume1 /etc/ntfsvolume1
|
||||
|
||||
You can obviously replace "myvolume1" with whatever name you like.
|
||||
|
||||
If it all worked, you will now have the device /dev/device-mapper/myvolume1
|
||||
which you can then just use as an argument to the mount command as usual to
|
||||
mount the ntfs volume. For example:
|
||||
|
||||
$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
|
||||
|
||||
(You need to create the directory /mnt/myvol1 first and of course you can use
|
||||
anything you like instead of /mnt/myvol1 as long as it is an existing
|
||||
directory.)
|
||||
|
||||
It is advisable to do the mount read-only to see if the volume has been setup
|
||||
correctly to avoid the possibility of causing damage to the data on the ntfs
|
||||
volume.
|
||||
|
||||
|
||||
The Software RAID / MD driver
|
||||
-----------------------------
|
||||
|
||||
An alternative to using the Device-Mapper driver is to use the kernel's
|
||||
Software RAID / MD driver. For which you need to set up your /etc/raidtab
|
||||
appropriately (see man 5 raidtab).
|
||||
|
||||
Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
|
||||
0, have been tested and work fine (though see section "Limitiations when using
|
||||
the MD driver with NTFS volumes" especially if you want to use linear raid).
|
||||
Even though untested, there is no reason why mirrors, i.e. raid level 1, and
|
||||
stripes with parity, i.e. raid level 5, should not work, too.
|
||||
|
||||
You have to use the "persistent-superblock 0" option for each raid-disk in the
|
||||
NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
|
||||
superblock used by the MD driver would damange the NTFS volume.
|
||||
|
||||
Windows by default uses a stripe chunk size of 64k, so you probably want the
|
||||
"chunk-size 64k" option for each raid-disk, too.
|
||||
|
||||
For example, if you have a stripe set consisting of two partitions /dev/hda5
|
||||
and /dev/hdb1 your /etc/raidtab would look like this:
|
||||
|
||||
raiddev /dev/md0
|
||||
raid-level 0
|
||||
nr-raid-disks 2
|
||||
nr-spare-disks 0
|
||||
persistent-superblock 0
|
||||
chunk-size 64k
|
||||
device /dev/hda5
|
||||
raid-disk 0
|
||||
device /dev/hdb1
|
||||
raid-disl 1
|
||||
|
||||
For linear raid, just change the raid-level above to "raid-level linear", for
|
||||
mirrors, change it to "raid-level 1", and for stripe sets with parity, change
|
||||
it to "raid-level 5".
|
||||
|
||||
Note for stripe sets with parity you will also need to tell the MD driver
|
||||
which parity algorithm to use by specifying the option "parity-algorithm
|
||||
which", where you need to replace "which" with the name of the algorithm to
|
||||
use (see man 5 raidtab for available algorithms) and you will have to try the
|
||||
different available algorithms until you find one that works. Make sure you
|
||||
are working read-only when playing with this as you may damage your data
|
||||
otherwise. If you find which algorithm works please let us know (email the
|
||||
linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
|
||||
IRC in channel #ntfs on the irc.freenode.net network) so we can update this
|
||||
documentation.
|
||||
|
||||
Once the raidtab is setup, run for example raid0run -a to start all devices or
|
||||
raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
|
||||
|
||||
Then just use the mount command as usual to mount the ntfs volume using for
|
||||
example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
|
||||
|
||||
It is advisable to do the mount read-only to see if the md volume has been
|
||||
setup correctly to avoid the possibility of causing damage to the data on the
|
||||
ntfs volume.
|
||||
|
||||
|
||||
Limitiations when using the Software RAID / MD driver
|
||||
-----------------------------------------------------
|
||||
|
||||
Using the md driver will not work properly if any of your NTFS partitions have
|
||||
an odd number of sectors. This is especially important for linear raid as all
|
||||
data after the first partition with an odd number of sectors will be offset by
|
||||
one or more sectors so if you mount such a partition with write support you
|
||||
will cause massive damage to the data on the volume which will only become
|
||||
apparent when you try to use the volume again under Windows.
|
||||
|
||||
So when using linear raid, make sure that all your partitions have an even
|
||||
number of sectors BEFORE attempting to use it. You have been warned!
|
||||
|
||||
Even better is to simply use the Device-Mapper for linear raid and then you do
|
||||
not have this problem with odd numbers of sectors.
|
||||
|
||||
|
||||
ChangeLog
|
||||
=========
|
||||
|
||||
Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
|
||||
|
||||
2.1.22:
|
||||
- Improve handling of ntfs volumes with errors.
|
||||
- Fix various bugs and race conditions.
|
||||
2.1.21:
|
||||
- Fix several race conditions and various other bugs.
|
||||
- Many internal cleanups, code reorganization, optimizations, and mft
|
||||
and index record writing code rewritten to fit in with the changes.
|
||||
- Update Documentation/filesystems/ntfs.txt with instructions on how to
|
||||
use the Device-Mapper driver with NTFS ftdisk/LDM raid.
|
||||
2.1.20:
|
||||
- Fix two stupid bugs introduced in 2.1.18 release.
|
||||
2.1.19:
|
||||
- Minor bugfix in handling of the default upcase table.
|
||||
- Many internal cleanups and improvements. Many thanks to Linus
|
||||
Torvalds and Al Viro for the help and advice with the sparse
|
||||
annotations and cleanups.
|
||||
2.1.18:
|
||||
- Fix scheduling latencies at mount time. (Ingo Molnar)
|
||||
- Fix endianness bug in a little traversed portion of the attribute
|
||||
lookup code.
|
||||
2.1.17:
|
||||
- Fix bugs in mount time error code paths.
|
||||
2.1.16:
|
||||
- Implement access time updates (including mtime and ctime).
|
||||
- Implement fsync(2), fdatasync(2), and msync(2) system calls.
|
||||
- Enable the readv(2) and writev(2) system calls.
|
||||
- Enable access via the asynchronous io (aio) API by adding support for
|
||||
the aio_read(3) and aio_write(3) functions.
|
||||
2.1.15:
|
||||
- Invalidate quotas when (re)mounting read-write.
|
||||
NOTE: This now only leave user space journalling on the side. (See
|
||||
note for version 2.1.13, below.)
|
||||
2.1.14:
|
||||
- Fix an NFSd caused deadlock reported by several users.
|
||||
2.1.13:
|
||||
- Implement writing of inodes (access time updates are not implemented
|
||||
yet so mounting with -o noatime,nodiratime is enforced).
|
||||
- Enable writing out of resident files so you can now overwrite any
|
||||
uncompressed, unencrypted, nonsparse file as long as you do not
|
||||
change the file size.
|
||||
- Add housekeeping of ntfs system files so that ntfsfix no longer needs
|
||||
to be run after writing to an NTFS volume.
|
||||
NOTE: This still leaves quota tracking and user space journalling on
|
||||
the side but they should not cause data corruption. In the worst
|
||||
case the charged quotas will be out of date ($Quota) and some
|
||||
userspace applications might get confused due to the out of date
|
||||
userspace journal ($UsnJrnl).
|
||||
2.1.12:
|
||||
- Fix the second fix to the decompression engine from the 2.1.9 release
|
||||
and some further internals cleanups.
|
||||
2.1.11:
|
||||
- Driver internal cleanups.
|
||||
2.1.10:
|
||||
- Force read-only (re)mounting of volumes with unsupported volume
|
||||
flags and various cleanups.
|
||||
2.1.9:
|
||||
- Fix two bugs in handling of corner cases in the decompression engine.
|
||||
2.1.8:
|
||||
- Read the $MFT mirror and compare it to the $MFT and if the two do not
|
||||
match, force a read-only mount and do not allow read-write remounts.
|
||||
- Read and parse the $LogFile journal and if it indicates that the
|
||||
volume was not shutdown cleanly, force a read-only mount and do not
|
||||
allow read-write remounts. If the $LogFile indicates a clean
|
||||
shutdown and a read-write (re)mount is requested, empty $LogFile to
|
||||
ensure that Windows cannot cause data corruption by replaying a stale
|
||||
journal after Linux has written to the volume.
|
||||
- Improve time handling so that the NTFS time is fully preserved when
|
||||
converted to kernel time and only up to 99 nano-seconds are lost when
|
||||
kernel time is converted to NTFS time.
|
||||
2.1.7:
|
||||
- Enable NFS exporting of mounted NTFS volumes.
|
||||
2.1.6:
|
||||
- Fix minor bug in handling of compressed directories that fixes the
|
||||
erroneous "du" and "stat" output people reported.
|
||||
2.1.5:
|
||||
- Minor bug fix in attribute list attribute handling that fixes the
|
||||
I/O errors on "ls" of certain fragmented files found by at least two
|
||||
people running Windows XP.
|
||||
2.1.4:
|
||||
- Minor update allowing compilation with all gcc versions (well, the
|
||||
ones the kernel can be compiled with anyway).
|
||||
2.1.3:
|
||||
- Major bug fixes for reading files and volumes in corner cases which
|
||||
were being hit by Windows 2k/XP users.
|
||||
2.1.2:
|
||||
- Major bug fixes aleviating the hangs in statfs experienced by some
|
||||
users.
|
||||
2.1.1:
|
||||
- Update handling of compressed files so people no longer get the
|
||||
frequently reported warning messages about initialized_size !=
|
||||
data_size.
|
||||
2.1.0:
|
||||
- Add configuration option for developmental write support.
|
||||
- Initial implementation of file overwriting. (Writes to resident files
|
||||
are not written out to disk yet, so avoid writing to files smaller
|
||||
than about 1kiB.)
|
||||
- Intercept/abort changes in file size as they are not implemented yet.
|
||||
2.0.25:
|
||||
- Minor bugfixes in error code paths and small cleanups.
|
||||
2.0.24:
|
||||
- Small internal cleanups.
|
||||
- Support for sendfile system call. (Christoph Hellwig)
|
||||
2.0.23:
|
||||
- Massive internal locking changes to mft record locking. Fixes
|
||||
various race conditions and deadlocks.
|
||||
- Fix ntfs over loopback for compressed files by adding an
|
||||
optimization barrier. (gcc was screwing up otherwise ?)
|
||||
Thanks go to Christoph Hellwig for pointing these two out:
|
||||
- Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs().
|
||||
- Fix ntfs_free() for ia64 and parisc.
|
||||
2.0.22:
|
||||
- Small internal cleanups.
|
||||
2.0.21:
|
||||
These only affect 32-bit architectures:
|
||||
- Check for, and refuse to mount too large volumes (maximum is 2TiB).
|
||||
- Check for, and refuse to open too large files and directories
|
||||
(maximum is 16TiB).
|
||||
2.0.20:
|
||||
- Support non-resident directory index bitmaps. This means we now cope
|
||||
with huge directories without problems.
|
||||
- Fix a page leak that manifested itself in some cases when reading
|
||||
directory contents.
|
||||
- Internal cleanups.
|
||||
2.0.19:
|
||||
- Fix race condition and improvements in block i/o interface.
|
||||
- Optimization when reading compressed files.
|
||||
2.0.18:
|
||||
- Fix race condition in reading of compressed files.
|
||||
2.0.17:
|
||||
- Cleanups and optimizations.
|
||||
2.0.16:
|
||||
- Fix stupid bug introduced in 2.0.15 in new attribute inode API.
|
||||
- Big internal cleanup replacing the mftbmp access hacks by using the
|
||||
new attribute inode API instead.
|
||||
2.0.15:
|
||||
- Bug fix in parsing of remount options.
|
||||
- Internal changes implementing attribute (fake) inodes allowing all
|
||||
attribute i/o to go via the page cache and to use all the normal
|
||||
vfs/mm functionality.
|
||||
2.0.14:
|
||||
- Internal changes improving run list merging code and minor locking
|
||||
change to not rely on BKL in ntfs_statfs().
|
||||
2.0.13:
|
||||
- Internal changes towards using iget5_locked() in preparation for
|
||||
fake inodes and small cleanups to ntfs_volume structure.
|
||||
2.0.12:
|
||||
- Internal cleanups in address space operations made possible by the
|
||||
changes introduced in the previous release.
|
||||
2.0.11:
|
||||
- Internal updates and cleanups introducing the first step towards
|
||||
fake inode based attribute i/o.
|
||||
2.0.10:
|
||||
- Microsoft says that the maximum number of inodes is 2^32 - 1. Update
|
||||
the driver accordingly to only use 32-bits to store inode numbers on
|
||||
32-bit architectures. This improves the speed of the driver a little.
|
||||
2.0.9:
|
||||
- Change decompression engine to use a single buffer. This should not
|
||||
affect performance except perhaps on the most heavy i/o on SMP
|
||||
systems when accessing multiple compressed files from multiple
|
||||
devices simultaneously.
|
||||
- Minor updates and cleanups.
|
||||
2.0.8:
|
||||
- Remove now obsolete show_inodes and posix mount option(s).
|
||||
- Restore show_sys_files mount option.
|
||||
- Add new mount option case_sensitive, to determine if the driver
|
||||
treats file names as case sensitive or not.
|
||||
- Mostly drop support for short file names (for backwards compatibility
|
||||
we only support accessing files via their short file name if one
|
||||
exists).
|
||||
- Fix dcache aliasing issues wrt short/long file names.
|
||||
- Cleanups and minor fixes.
|
||||
2.0.7:
|
||||
- Just cleanups.
|
||||
2.0.6:
|
||||
- Major bugfix to make compatible with other kernel changes. This fixes
|
||||
the hangs/oopses on umount.
|
||||
- Locking cleanup in directory operations (remove BKL usage).
|
||||
2.0.5:
|
||||
- Major buffer overflow bug fix.
|
||||
- Minor cleanups and updates for kernel 2.5.12.
|
||||
2.0.4:
|
||||
- Cleanups and updates for kernel 2.5.11.
|
||||
2.0.3:
|
||||
- Small bug fixes, cleanups, and performance improvements.
|
||||
2.0.2:
|
||||
- Use default fmask of 0177 so that files are no executable by default.
|
||||
If you want owner executable files, just use fmask=0077.
|
||||
- Update for kernel 2.5.9 but preserve backwards compatibility with
|
||||
kernel 2.5.7.
|
||||
- Minor bug fixes, cleanups, and updates.
|
||||
2.0.1:
|
||||
- Minor updates, primarily set the executable bit by default on files
|
||||
so they can be executed.
|
||||
2.0.0:
|
||||
- Started ChangeLog.
|
||||
|
266
Documentation/filesystems/porting
Normal file
266
Documentation/filesystems/porting
Normal file
@@ -0,0 +1,266 @@
|
||||
Changes since 2.5.0:
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
|
||||
sb_set_blocksize() and sb_min_blocksize().
|
||||
|
||||
Use them.
|
||||
|
||||
(sb_find_get_block() replaces 2.4's get_hash_table())
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
New methods: ->alloc_inode() and ->destroy_inode().
|
||||
|
||||
Remove inode->u.foo_inode_i
|
||||
Declare
|
||||
struct foo_inode_info {
|
||||
/* fs-private stuff */
|
||||
struct inode vfs_inode;
|
||||
};
|
||||
static inline struct foo_inode_info *FOO_I(struct inode *inode)
|
||||
{
|
||||
return list_entry(inode, struct foo_inode_info, vfs_inode);
|
||||
}
|
||||
|
||||
Use FOO_I(inode) instead of &inode->u.foo_inode_i;
|
||||
|
||||
Add foo_alloc_inode() and foo_destory_inode() - the former should allocate
|
||||
foo_inode_info and return the address of ->vfs_inode, the latter should free
|
||||
FOO_I(inode) (see in-tree filesystems for examples).
|
||||
|
||||
Make them ->alloc_inode and ->destroy_inode in your super_operations.
|
||||
|
||||
Keep in mind that now you need explicit initialization of private data -
|
||||
typically in ->read_inode() and after getting an inode from new_inode().
|
||||
|
||||
At some point that will become mandatory.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
Change of file_system_type method (->read_super to ->get_sb)
|
||||
|
||||
->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
|
||||
|
||||
Turn your foo_read_super() into a function that would return 0 in case of
|
||||
success and negative number in case of error (-EINVAL unless you have more
|
||||
informative error value to report). Call it foo_fill_super(). Now declare
|
||||
|
||||
struct super_block foo_get_sb(struct file_system_type *fs_type,
|
||||
int flags, const char *dev_name, void *data)
|
||||
{
|
||||
return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
|
||||
}
|
||||
|
||||
(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
|
||||
filesystem).
|
||||
|
||||
Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
|
||||
foo_get_sb.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
|
||||
Most likely there is no need to change anything, but if you relied on
|
||||
global exclusion between renames for some internal purpose - you need to
|
||||
change your internal locking. Otherwise exclusion warranties remain the
|
||||
same (i.e. parents and victim are locked, etc.).
|
||||
|
||||
---
|
||||
[informational]
|
||||
|
||||
Now we have the exclusion between ->lookup() and directory removal (by
|
||||
->rmdir() and ->rename()). If you used to need that exclusion and do
|
||||
it by internal locking (most of filesystems couldn't care less) - you
|
||||
can relax your locking.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
|
||||
->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
|
||||
and ->readdir() are called without BKL now. Grab it on entry, drop upon return
|
||||
- that will guarantee the same locking you used to have. If your method or its
|
||||
parts do not need BKL - better yet, now you can shift lock_kernel() and
|
||||
unlock_kernel() so that they would protect exactly what needs to be
|
||||
protected.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
BKL is also moved from around sb operations. ->write_super() Is now called
|
||||
without BKL held. BKL should have been shifted into individual fs sb_op
|
||||
functions. If you don't need it, remove it.
|
||||
|
||||
---
|
||||
[informational]
|
||||
|
||||
check for ->link() target not being a directory is done by callers. Feel
|
||||
free to drop it...
|
||||
|
||||
---
|
||||
[informational]
|
||||
|
||||
->link() callers hold ->i_sem on the object we are linking to. Some of your
|
||||
problems might be over...
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
new file_system_type method - kill_sb(superblock). If you are converting
|
||||
an existing filesystem, set it according to ->fs_flags:
|
||||
FS_REQUIRES_DEV - kill_block_super
|
||||
FS_LITTER - kill_litter_super
|
||||
neither - kill_anon_super
|
||||
FS_LITTER is gone - just remove it from fs_flags.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
FS_SINGLE is gone (actually, that had happened back when ->get_sb()
|
||||
went in - and hadn't been documented ;-/). Just remove it from fs_flags
|
||||
(and see ->get_sb() entry for other actions).
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->setattr() is called without BKL now. Caller _always_ holds ->i_sem, so
|
||||
watch for ->i_sem-grabbing code that might be used by your ->setattr().
|
||||
Callers of notify_change() need ->i_sem now.
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
New super_block field "struct export_operations *s_export_op" for
|
||||
explicit support for exporting, e.g. via NFS. The structure is fully
|
||||
documented at its declaration in include/linux/fs.h, and in
|
||||
Documentation/filesystems/Exporting.
|
||||
|
||||
Briefly it allows for the definition of decode_fh and encode_fh operations
|
||||
to encode and decode filehandles, and allows the filesystem to use
|
||||
a standard helper function for decode_fh, and provide file-system specific
|
||||
support for this helper, particularly get_parent.
|
||||
|
||||
It is planned that this will be required for exporting once the code
|
||||
settles down a bit.
|
||||
|
||||
[mandatory]
|
||||
|
||||
s_export_op is now required for exporting a filesystem.
|
||||
isofs, ext2, ext3, resierfs, fat
|
||||
can be used as examples of very different filesystems.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
iget4() and the read_inode2 callback have been superseded by iget5_locked()
|
||||
which has the following prototype,
|
||||
|
||||
struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
|
||||
int (*test)(struct inode *, void *),
|
||||
int (*set)(struct inode *, void *),
|
||||
void *data);
|
||||
|
||||
'test' is an additional function that can be used when the inode
|
||||
number is not sufficient to identify the actual file object. 'set'
|
||||
should be a non-blocking function that initializes those parts of a
|
||||
newly created inode to allow the test function to succeed. 'data' is
|
||||
passed as an opaque value to both test and set functions.
|
||||
|
||||
When the inode has been created by iget5_locked(), it will be returned with
|
||||
the I_NEW flag set and will still be locked. read_inode has not been
|
||||
called so the file system still has to finalize the initialization. Once
|
||||
the inode is initialized it must be unlocked by calling unlock_new_inode().
|
||||
|
||||
The filesystem is responsible for setting (and possibly testing) i_ino
|
||||
when appropriate. There is also a simpler iget_locked function that
|
||||
just takes the superblock and inode number as arguments and does the
|
||||
test and set for you.
|
||||
|
||||
e.g.
|
||||
inode = iget_locked(sb, ino);
|
||||
if (inode->i_state & I_NEW) {
|
||||
read_inode_from_disk(inode);
|
||||
unlock_new_inode(inode);
|
||||
}
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
->getattr() finally getting used. See instances in nfs, minix, etc.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->revalidate() is gone. If your filesystem had it - provide ->getattr()
|
||||
and let it call whatever you had as ->revlidate() + (for symlinks that
|
||||
had ->revalidate()) add calls in ->follow_link()/->readlink().
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->d_parent changes are not protected by BKL anymore. Read access is safe
|
||||
if at least one of the following is true:
|
||||
* filesystem has no cross-directory rename()
|
||||
* dcache_lock is held
|
||||
* we know that parent had been locked (e.g. we are looking at
|
||||
->d_parent of ->lookup() argument).
|
||||
* we are called from ->rename().
|
||||
* the child's ->d_lock is held
|
||||
Audit your code and add locking if needed. Notice that any place that is
|
||||
not protected by the conditions above is risky even in the old tree - you
|
||||
had been relying on BKL and that's prone to screwups. Old tree had quite
|
||||
a few holes of that kind - unprotected access to ->d_parent leading to
|
||||
anything from oops to silent memory corruption.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
|
||||
(see rootfs for one kind of solution and bdev/socket/pipe for another).
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
|
||||
is still alive, but only because of the mess in drivers/s390/block/dasd.c.
|
||||
As soon as it gets fixed is_read_only() will die.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->permission() is called without BKL now. Grab it on entry, drop upon
|
||||
return - that will guarantee the same locking you used to have. If
|
||||
your method or its parts do not need BKL - better yet, now you can
|
||||
shift lock_kernel() and unlock_kernel() so that they would protect
|
||||
exactly what needs to be protected.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->statfs() is now called without BKL held. BKL should have been
|
||||
shifted into individual fs sb_op functions where it's not clear that
|
||||
it's safe to remove it. If you don't need it, remove it.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
is_read_only() is gone; use bdev_read_only() instead.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
destroy_buffers() is gone; use invalidate_bdev().
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
|
||||
deliberate; as soon as struct block_device * is propagated in a reasonable
|
||||
way by that code fixing will become trivial; until then nothing can be
|
||||
done.
|
1940
Documentation/filesystems/proc.txt
Normal file
1940
Documentation/filesystems/proc.txt
Normal file
File diff suppressed because it is too large
Load Diff
187
Documentation/filesystems/romfs.txt
Normal file
187
Documentation/filesystems/romfs.txt
Normal file
@@ -0,0 +1,187 @@
|
||||
ROMFS - ROM FILE SYSTEM
|
||||
|
||||
This is a quite dumb, read only filesystem, mainly for initial RAM
|
||||
disks of installation disks. It has grown up by the need of having
|
||||
modules linked at boot time. Using this filesystem, you get a very
|
||||
similar feature, and even the possibility of a small kernel, with a
|
||||
file system which doesn't take up useful memory from the router
|
||||
functions in the basement of your office.
|
||||
|
||||
For comparison, both the older minix and xiafs (the latter is now
|
||||
defunct) filesystems, compiled as module need more than 20000 bytes,
|
||||
while romfs is less than a page, about 4000 bytes (assuming i586
|
||||
code). Under the same conditions, the msdos filesystem would need
|
||||
about 30K (and does not support device nodes or symlinks), while the
|
||||
nfs module with nfsroot is about 57K. Furthermore, as a bit unfair
|
||||
comparison, an actual rescue disk used up 3202 blocks with ext2, while
|
||||
with romfs, it needed 3079 blocks.
|
||||
|
||||
To create such a file system, you'll need a user program named
|
||||
genromfs. It is available via anonymous ftp on sunsite.unc.edu and
|
||||
its mirrors, in the /pub/Linux/system/recovery/ directory.
|
||||
|
||||
As the name suggests, romfs could be also used (space-efficiently) on
|
||||
various read-only media, like (E)EPROM disks if someone will have the
|
||||
motivation.. :)
|
||||
|
||||
However, the main purpose of romfs is to have a very small kernel,
|
||||
which has only this filesystem linked in, and then can load any module
|
||||
later, with the current module utilities. It can also be used to run
|
||||
some program to decide if you need SCSI devices, and even IDE or
|
||||
floppy drives can be loaded later if you use the "initrd"--initial
|
||||
RAM disk--feature of the kernel. This would not be really news
|
||||
flash, but with romfs, you can even spare off your ext2 or minix or
|
||||
maybe even affs filesystem until you really know that you need it.
|
||||
|
||||
For example, a distribution boot disk can contain only the cd disk
|
||||
drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem
|
||||
module. The kernel can be small enough, since it doesn't have other
|
||||
filesystems, like the quite large ext2fs module, which can then be
|
||||
loaded off the CD at a later stage of the installation. Another use
|
||||
would be for a recovery disk, when you are reinstalling a workstation
|
||||
from the network, and you will have all the tools/modules available
|
||||
from a nearby server, so you don't want to carry two disks for this
|
||||
purpose, just because it won't fit into ext2.
|
||||
|
||||
romfs operates on block devices as you can expect, and the underlying
|
||||
structure is very simple. Every accessible structure begins on 16
|
||||
byte boundaries for fast access. The minimum space a file will take
|
||||
is 32 bytes (this is an empty file, with a less than 16 character
|
||||
name). The maximum overhead for any non-empty file is the header, and
|
||||
the 16 byte padding for the name and the contents, also 16+14+15 = 45
|
||||
bytes. This is quite rare however, since most file names are longer
|
||||
than 3 bytes, and shorter than 15 bytes.
|
||||
|
||||
The layout of the filesystem is the following:
|
||||
|
||||
offset content
|
||||
|
||||
+---+---+---+---+
|
||||
0 | - | r | o | m | \
|
||||
+---+---+---+---+ The ASCII representation of those bytes
|
||||
4 | 1 | f | s | - | / (i.e. "-rom1fs-")
|
||||
+---+---+---+---+
|
||||
8 | full size | The number of accessible bytes in this fs.
|
||||
+---+---+---+---+
|
||||
12 | checksum | The checksum of the FIRST 512 BYTES.
|
||||
+---+---+---+---+
|
||||
16 | volume name | The zero terminated name of the volume,
|
||||
: : padded to 16 byte boundary.
|
||||
+---+---+---+---+
|
||||
xx | file |
|
||||
: headers :
|
||||
|
||||
Every multi byte value (32 bit words, I'll use the longwords term from
|
||||
now on) must be in big endian order.
|
||||
|
||||
The first eight bytes identify the filesystem, even for the casual
|
||||
inspector. After that, in the 3rd longword, it contains the number of
|
||||
bytes accessible from the start of this filesystem. The 4th longword
|
||||
is the checksum of the first 512 bytes (or the number of bytes
|
||||
accessible, whichever is smaller). The applied algorithm is the same
|
||||
as in the AFFS filesystem, namely a simple sum of the longwords
|
||||
(assuming bigendian quantities again). For details, please consult
|
||||
the source. This algorithm was chosen because although it's not quite
|
||||
reliable, it does not require any tables, and it is very simple.
|
||||
|
||||
The following bytes are now part of the file system; each file header
|
||||
must begin on a 16 byte boundary.
|
||||
|
||||
offset content
|
||||
|
||||
+---+---+---+---+
|
||||
0 | next filehdr|X| The offset of the next file header
|
||||
+---+---+---+---+ (zero if no more files)
|
||||
4 | spec.info | Info for directories/hard links/devices
|
||||
+---+---+---+---+
|
||||
8 | size | The size of this file in bytes
|
||||
+---+---+---+---+
|
||||
12 | checksum | Covering the meta data, including the file
|
||||
+---+---+---+---+ name, and padding
|
||||
16 | file name | The zero terminated name of the file,
|
||||
: : padded to 16 byte boundary
|
||||
+---+---+---+---+
|
||||
xx | file data |
|
||||
: :
|
||||
|
||||
Since the file headers begin always at a 16 byte boundary, the lowest
|
||||
4 bits would be always zero in the next filehdr pointer. These four
|
||||
bits are used for the mode information. Bits 0..2 specify the type of
|
||||
the file; while bit 4 shows if the file is executable or not. The
|
||||
permissions are assumed to be world readable, if this bit is not set,
|
||||
and world executable if it is; except the character and block devices,
|
||||
they are never accessible for other than owner. The owner of every
|
||||
file is user and group 0, this should never be a problem for the
|
||||
intended use. The mapping of the 8 possible values to file types is
|
||||
the following:
|
||||
|
||||
mapping spec.info means
|
||||
0 hard link link destination [file header]
|
||||
1 directory first file's header
|
||||
2 regular file unused, must be zero [MBZ]
|
||||
3 symbolic link unused, MBZ (file data is the link content)
|
||||
4 block device 16/16 bits major/minor number
|
||||
5 char device - " -
|
||||
6 socket unused, MBZ
|
||||
7 fifo unused, MBZ
|
||||
|
||||
Note that hard links are specifically marked in this filesystem, but
|
||||
they will behave as you can expect (i.e. share the inode number).
|
||||
Note also that it is your responsibility to not create hard link
|
||||
loops, and creating all the . and .. links for directories. This is
|
||||
normally done correctly by the genromfs program. Please refrain from
|
||||
using the executable bits for special purposes on the socket and fifo
|
||||
special files, they may have other uses in the future. Additionally,
|
||||
please remember that only regular files, and symlinks are supposed to
|
||||
have a nonzero size field; they contain the number of bytes available
|
||||
directly after the (padded) file name.
|
||||
|
||||
Another thing to note is that romfs works on file headers and data
|
||||
aligned to 16 byte boundaries, but most hardware devices and the block
|
||||
device drivers are unable to cope with smaller than block-sized data.
|
||||
To overcome this limitation, the whole size of the file system must be
|
||||
padded to an 1024 byte boundary.
|
||||
|
||||
If you have any problems or suggestions concerning this file system,
|
||||
please contact me. However, think twice before wanting me to add
|
||||
features and code, because the primary and most important advantage of
|
||||
this file system is the small code. On the other hand, don't be
|
||||
alarmed, I'm not getting that much romfs related mail. Now I can
|
||||
understand why Avery wrote poems in the ARCnet docs to get some more
|
||||
feedback. :)
|
||||
|
||||
romfs has also a mailing list, and to date, it hasn't received any
|
||||
traffic, so you are welcome to join it to discuss your ideas. :)
|
||||
|
||||
It's run by ezmlm, so you can subscribe to it by sending a message
|
||||
to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
|
||||
|
||||
Pending issues:
|
||||
|
||||
- Permissions and owner information are pretty essential features of a
|
||||
Un*x like system, but romfs does not provide the full possibilities.
|
||||
I have never found this limiting, but others might.
|
||||
|
||||
- The file system is read only, so it can be very small, but in case
|
||||
one would want to write _anything_ to a file system, he still needs
|
||||
a writable file system, thus negating the size advantages. Possible
|
||||
solutions: implement write access as a compile-time option, or a new,
|
||||
similarly small writable filesystem for RAM disks.
|
||||
|
||||
- Since the files are only required to have alignment on a 16 byte
|
||||
boundary, it is currently possibly suboptimal to read or execute files
|
||||
from the filesystem. It might be resolved by reordering file data to
|
||||
have most of it (i.e. except the start and the end) laying at "natural"
|
||||
boundaries, thus it would be possible to directly map a big portion of
|
||||
the file contents to the mm subsystem.
|
||||
|
||||
- Compression might be an useful feature, but memory is quite a
|
||||
limiting factor in my eyes.
|
||||
|
||||
- Where it is used?
|
||||
|
||||
- Does it work on other architectures than intel and motorola?
|
||||
|
||||
|
||||
Have fun,
|
||||
Janos Farkas <chexum@shadow.banki.hu>
|
8
Documentation/filesystems/smbfs.txt
Normal file
8
Documentation/filesystems/smbfs.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
Smbfs is a filesystem that implements the SMB protocol, which is the
|
||||
protocol used by Windows for Workgroups, Windows 95 and Windows NT.
|
||||
Smbfs was inspired by Samba, the program written by Andrew Tridgell
|
||||
that turns any Unix host into a file server for DOS or Windows clients.
|
||||
|
||||
Smbfs is a SMB client, but uses parts of samba for it's operation. For
|
||||
more info on samba, including documentation, please go to
|
||||
http://www.samba.org/ and then on to your nearest mirror.
|
88
Documentation/filesystems/sysfs-pci.txt
Normal file
88
Documentation/filesystems/sysfs-pci.txt
Normal file
@@ -0,0 +1,88 @@
|
||||
Accessing PCI device resources through sysfs
|
||||
|
||||
sysfs, usually mounted at /sys, provides access to PCI resources on platforms
|
||||
that support it. For example, a given bus might look like this:
|
||||
|
||||
/sys/devices/pci0000:17
|
||||
|-- 0000:17:00.0
|
||||
| |-- class
|
||||
| |-- config
|
||||
| |-- detach_state
|
||||
| |-- device
|
||||
| |-- irq
|
||||
| |-- local_cpus
|
||||
| |-- resource
|
||||
| |-- resource0
|
||||
| |-- resource1
|
||||
| |-- resource2
|
||||
| |-- rom
|
||||
| |-- subsystem_device
|
||||
| |-- subsystem_vendor
|
||||
| `-- vendor
|
||||
`-- detach_state
|
||||
|
||||
The topmost element describes the PCI domain and bus number. In this case,
|
||||
the domain number is 0000 and the bus number is 17 (both values are in hex).
|
||||
This bus contains a single function device in slot 0. The domain and bus
|
||||
numbers are reproduced for convenience. Under the device directory are several
|
||||
files, each with their own function.
|
||||
|
||||
file function
|
||||
---- --------
|
||||
class PCI class (ascii, ro)
|
||||
config PCI config space (binary, rw)
|
||||
detach_state connection status (bool, rw)
|
||||
device PCI device (ascii, ro)
|
||||
irq IRQ number (ascii, ro)
|
||||
local_cpus nearby CPU mask (cpumask, ro)
|
||||
resource PCI resource host addresses (ascii, ro)
|
||||
resource0..N PCI resource N, if present (binary, mmap)
|
||||
rom PCI ROM resource, if present (binary, ro)
|
||||
subsystem_device PCI subsystem device (ascii, ro)
|
||||
subsystem_vendor PCI subsystem vendor (ascii, ro)
|
||||
vendor PCI vendor (ascii, ro)
|
||||
|
||||
ro - read only file
|
||||
rw - file is readable and writable
|
||||
mmap - file is mmapable
|
||||
ascii - file contains ascii text
|
||||
binary - file contains binary data
|
||||
cpumask - file contains a cpumask type
|
||||
|
||||
The read only files are informational, writes to them will be ignored.
|
||||
Writable files can be used to perform actions on the device (e.g. changing
|
||||
config space, detaching a device). mmapable files are available via an
|
||||
mmap of the file at offset 0 and can be used to do actual device programming
|
||||
from userspace. Note that some platforms don't support mmapping of certain
|
||||
resources, so be sure to check the return value from any attempted mmap.
|
||||
|
||||
Accessing legacy resources through sysfs
|
||||
|
||||
Legacy I/O port and ISA memory resources are also provided in sysfs if the
|
||||
underlying platform supports them. They're located in the PCI class heirarchy,
|
||||
e.g.
|
||||
|
||||
/sys/class/pci_bus/0000:17/
|
||||
|-- bridge -> ../../../devices/pci0000:17
|
||||
|-- cpuaffinity
|
||||
|-- legacy_io
|
||||
`-- legacy_mem
|
||||
|
||||
The legacy_io file is a read/write file that can be used by applications to
|
||||
do legacy port I/O. The application should open the file, seek to the desired
|
||||
port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
|
||||
file should be mmapped with an offset corresponding to the memory offset
|
||||
desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
|
||||
simply dereference the returned pointer (after checking for errors of course)
|
||||
to access legacy memory space.
|
||||
|
||||
Supporting PCI access on new platforms
|
||||
|
||||
In order to support PCI resource mapping as described above, Linux platform
|
||||
code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
|
||||
Platforms are free to only support subsets of the mmap functionality, but
|
||||
useful return codes should be provided.
|
||||
|
||||
Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
|
||||
wishing to support legacy functionality should define it and provide
|
||||
pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.
|
341
Documentation/filesystems/sysfs.txt
Normal file
341
Documentation/filesystems/sysfs.txt
Normal file
@@ -0,0 +1,341 @@
|
||||
|
||||
sysfs - _The_ filesystem for exporting kernel objects.
|
||||
|
||||
Patrick Mochel <mochel@osdl.org>
|
||||
|
||||
10 January 2003
|
||||
|
||||
|
||||
What it is:
|
||||
~~~~~~~~~~~
|
||||
|
||||
sysfs is a ram-based filesystem initially based on ramfs. It provides
|
||||
a means to export kernel data structures, their attributes, and the
|
||||
linkages between them to userspace.
|
||||
|
||||
sysfs is tied inherently to the kobject infrastructure. Please read
|
||||
Documentation/kobject.txt for more information concerning the kobject
|
||||
interface.
|
||||
|
||||
|
||||
Using sysfs
|
||||
~~~~~~~~~~~
|
||||
|
||||
sysfs is always compiled in. You can access it by doing:
|
||||
|
||||
mount -t sysfs sysfs /sys
|
||||
|
||||
|
||||
Directory Creation
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For every kobject that is registered with the system, a directory is
|
||||
created for it in sysfs. That directory is created as a subdirectory
|
||||
of the kobject's parent, expressing internal object hierarchies to
|
||||
userspace. Top-level directories in sysfs represent the common
|
||||
ancestors of object hierarchies; i.e. the subsystems the objects
|
||||
belong to.
|
||||
|
||||
Sysfs internally stores the kobject that owns the directory in the
|
||||
->d_fsdata pointer of the directory's dentry. This allows sysfs to do
|
||||
reference counting directly on the kobject when the file is opened and
|
||||
closed.
|
||||
|
||||
|
||||
Attributes
|
||||
~~~~~~~~~~
|
||||
|
||||
Attributes can be exported for kobjects in the form of regular files in
|
||||
the filesystem. Sysfs forwards file I/O operations to methods defined
|
||||
for the attributes, providing a means to read and write kernel
|
||||
attributes.
|
||||
|
||||
Attributes should be ASCII text files, preferably with only one value
|
||||
per file. It is noted that it may not be efficient to contain only
|
||||
value per file, so it is socially acceptable to express an array of
|
||||
values of the same type.
|
||||
|
||||
Mixing types, expressing multiple lines of data, and doing fancy
|
||||
formatting of data is heavily frowned upon. Doing these things may get
|
||||
you publically humiliated and your code rewritten without notice.
|
||||
|
||||
|
||||
An attribute definition is simply:
|
||||
|
||||
struct attribute {
|
||||
char * name;
|
||||
mode_t mode;
|
||||
};
|
||||
|
||||
|
||||
int sysfs_create_file(struct kobject * kobj, struct attribute * attr);
|
||||
void sysfs_remove_file(struct kobject * kobj, struct attribute * attr);
|
||||
|
||||
|
||||
A bare attribute contains no means to read or write the value of the
|
||||
attribute. Subsystems are encouraged to define their own attribute
|
||||
structure and wrapper functions for adding and removing attributes for
|
||||
a specific object type.
|
||||
|
||||
For example, the driver model defines struct device_attribute like:
|
||||
|
||||
struct device_attribute {
|
||||
struct attribute attr;
|
||||
ssize_t (*show)(struct device * dev, char * buf);
|
||||
ssize_t (*store)(struct device * dev, const char * buf);
|
||||
};
|
||||
|
||||
int device_create_file(struct device *, struct device_attribute *);
|
||||
void device_remove_file(struct device *, struct device_attribute *);
|
||||
|
||||
It also defines this helper for defining device attributes:
|
||||
|
||||
#define DEVICE_ATTR(_name,_mode,_show,_store) \
|
||||
struct device_attribute dev_attr_##_name = { \
|
||||
.attr = {.name = __stringify(_name) , .mode = _mode }, \
|
||||
.show = _show, \
|
||||
.store = _store, \
|
||||
};
|
||||
|
||||
For example, declaring
|
||||
|
||||
static DEVICE_ATTR(foo,0644,show_foo,store_foo);
|
||||
|
||||
is equivalent to doing:
|
||||
|
||||
static struct device_attribute dev_attr_foo = {
|
||||
.attr = {
|
||||
.name = "foo",
|
||||
.mode = 0644,
|
||||
},
|
||||
.show = show_foo,
|
||||
.store = store_foo,
|
||||
};
|
||||
|
||||
|
||||
Subsystem-Specific Callbacks
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When a subsystem defines a new attribute type, it must implement a
|
||||
set of sysfs operations for forwarding read and write calls to the
|
||||
show and store methods of the attribute owners.
|
||||
|
||||
struct sysfs_ops {
|
||||
ssize_t (*show)(struct kobject *, struct attribute *,char *);
|
||||
ssize_t (*store)(struct kobject *,struct attribute *,const char *);
|
||||
};
|
||||
|
||||
[ Subsystems should have already defined a struct kobj_type as a
|
||||
descriptor for this type, which is where the sysfs_ops pointer is
|
||||
stored. See the kobject documentation for more information. ]
|
||||
|
||||
When a file is read or written, sysfs calls the appropriate method
|
||||
for the type. The method then translates the generic struct kobject
|
||||
and struct attribute pointers to the appropriate pointer types, and
|
||||
calls the associated methods.
|
||||
|
||||
|
||||
To illustrate:
|
||||
|
||||
#define to_dev_attr(_attr) container_of(_attr,struct device_attribute,attr)
|
||||
#define to_dev(d) container_of(d, struct device, kobj)
|
||||
|
||||
static ssize_t
|
||||
dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
|
||||
{
|
||||
struct device_attribute * dev_attr = to_dev_attr(attr);
|
||||
struct device * dev = to_dev(kobj);
|
||||
ssize_t ret = 0;
|
||||
|
||||
if (dev_attr->show)
|
||||
ret = dev_attr->show(dev,buf);
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
||||
|
||||
Reading/Writing Attribute Data
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To read or write attributes, show() or store() methods must be
|
||||
specified when declaring the attribute. The method types should be as
|
||||
simple as those defined for device attributes:
|
||||
|
||||
ssize_t (*show)(struct device * dev, char * buf);
|
||||
ssize_t (*store)(struct device * dev, const char * buf);
|
||||
|
||||
IOW, they should take only an object and a buffer as parameters.
|
||||
|
||||
|
||||
sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
|
||||
method. Sysfs will call the method exactly once for each read or
|
||||
write. This forces the following behavior on the method
|
||||
implementations:
|
||||
|
||||
- On read(2), the show() method should fill the entire buffer.
|
||||
Recall that an attribute should only be exporting one value, or an
|
||||
array of similar values, so this shouldn't be that expensive.
|
||||
|
||||
This allows userspace to do partial reads and seeks arbitrarily over
|
||||
the entire file at will.
|
||||
|
||||
- On write(2), sysfs expects the entire buffer to be passed during the
|
||||
first write. Sysfs then passes the entire buffer to the store()
|
||||
method.
|
||||
|
||||
When writing sysfs files, userspace processes should first read the
|
||||
entire file, modify the values it wishes to change, then write the
|
||||
entire buffer back.
|
||||
|
||||
Attribute method implementations should operate on an identical
|
||||
buffer when reading and writing values.
|
||||
|
||||
Other notes:
|
||||
|
||||
- The buffer will always be PAGE_SIZE bytes in length. On i386, this
|
||||
is 4096.
|
||||
|
||||
- show() methods should return the number of bytes printed into the
|
||||
buffer. This is the return value of snprintf().
|
||||
|
||||
- show() should always use snprintf().
|
||||
|
||||
- store() should return the number of bytes used from the buffer. This
|
||||
can be done using strlen().
|
||||
|
||||
- show() or store() can always return errors. If a bad value comes
|
||||
through, be sure to return an error.
|
||||
|
||||
- The object passed to the methods will be pinned in memory via sysfs
|
||||
referencing counting its embedded object. However, the physical
|
||||
entity (e.g. device) the object represents may not be present. Be
|
||||
sure to have a way to check this, if necessary.
|
||||
|
||||
|
||||
A very simple (and naive) implementation of a device attribute is:
|
||||
|
||||
static ssize_t show_name(struct device * dev, char * buf)
|
||||
{
|
||||
return sprintf(buf,"%s\n",dev->name);
|
||||
}
|
||||
|
||||
static ssize_t store_name(struct device * dev, const char * buf)
|
||||
{
|
||||
sscanf(buf,"%20s",dev->name);
|
||||
return strlen(buf);
|
||||
}
|
||||
|
||||
static DEVICE_ATTR(name,S_IRUGO,show_name,store_name);
|
||||
|
||||
|
||||
(Note that the real implementation doesn't allow userspace to set the
|
||||
name for a device.)
|
||||
|
||||
|
||||
Top Level Directory Layout
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The sysfs directory arrangement exposes the relationship of kernel
|
||||
data structures.
|
||||
|
||||
The top level sysfs diretory looks like:
|
||||
|
||||
block/
|
||||
bus/
|
||||
class/
|
||||
devices/
|
||||
firmware/
|
||||
net/
|
||||
|
||||
devices/ contains a filesystem representation of the device tree. It maps
|
||||
directly to the internal kernel device tree, which is a hierarchy of
|
||||
struct device.
|
||||
|
||||
bus/ contains flat directory layout of the various bus types in the
|
||||
kernel. Each bus's directory contains two subdirectories:
|
||||
|
||||
devices/
|
||||
drivers/
|
||||
|
||||
devices/ contains symlinks for each device discovered in the system
|
||||
that point to the device's directory under root/.
|
||||
|
||||
drivers/ contains a directory for each device driver that is loaded
|
||||
for devices on that particular bus (this assumes that drivers do not
|
||||
span multiple bus types).
|
||||
|
||||
|
||||
More information can driver-model specific features can be found in
|
||||
Documentation/driver-model/.
|
||||
|
||||
|
||||
TODO: Finish this section.
|
||||
|
||||
|
||||
Current Interfaces
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The following interface layers currently exist in sysfs:
|
||||
|
||||
|
||||
- devices (include/linux/device.h)
|
||||
----------------------------------
|
||||
Structure:
|
||||
|
||||
struct device_attribute {
|
||||
struct attribute attr;
|
||||
ssize_t (*show)(struct device * dev, char * buf);
|
||||
ssize_t (*store)(struct device * dev, const char * buf);
|
||||
};
|
||||
|
||||
Declaring:
|
||||
|
||||
DEVICE_ATTR(_name,_str,_mode,_show,_store);
|
||||
|
||||
Creation/Removal:
|
||||
|
||||
int device_create_file(struct device *device, struct device_attribute * attr);
|
||||
void device_remove_file(struct device * dev, struct device_attribute * attr);
|
||||
|
||||
|
||||
- bus drivers (include/linux/device.h)
|
||||
--------------------------------------
|
||||
Structure:
|
||||
|
||||
struct bus_attribute {
|
||||
struct attribute attr;
|
||||
ssize_t (*show)(struct bus_type *, char * buf);
|
||||
ssize_t (*store)(struct bus_type *, const char * buf);
|
||||
};
|
||||
|
||||
Declaring:
|
||||
|
||||
BUS_ATTR(_name,_mode,_show,_store)
|
||||
|
||||
Creation/Removal:
|
||||
|
||||
int bus_create_file(struct bus_type *, struct bus_attribute *);
|
||||
void bus_remove_file(struct bus_type *, struct bus_attribute *);
|
||||
|
||||
|
||||
- device drivers (include/linux/device.h)
|
||||
-----------------------------------------
|
||||
|
||||
Structure:
|
||||
|
||||
struct driver_attribute {
|
||||
struct attribute attr;
|
||||
ssize_t (*show)(struct device_driver *, char * buf);
|
||||
ssize_t (*store)(struct device_driver *, const char * buf);
|
||||
};
|
||||
|
||||
Declaring:
|
||||
|
||||
DRIVER_ATTR(_name,_mode,_show,_store)
|
||||
|
||||
Creation/Removal:
|
||||
|
||||
int driver_create_file(struct device_driver *, struct driver_attribute *);
|
||||
void driver_remove_file(struct device_driver *, struct driver_attribute *);
|
||||
|
||||
|
38
Documentation/filesystems/sysv-fs.txt
Normal file
38
Documentation/filesystems/sysv-fs.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
This is the implementation of the SystemV/Coherent filesystem for Linux.
|
||||
It implements all of
|
||||
- Xenix FS,
|
||||
- SystemV/386 FS,
|
||||
- Coherent FS.
|
||||
|
||||
This is version beta 4.
|
||||
|
||||
To install:
|
||||
* Answer the 'System V and Coherent filesystem support' question with 'y'
|
||||
when configuring the kernel.
|
||||
* To mount a disk or a partition, use
|
||||
mount [-r] -t sysv device mountpoint
|
||||
The file system type names
|
||||
-t sysv
|
||||
-t xenix
|
||||
-t coherent
|
||||
may be used interchangeably, but the last two will eventually disappear.
|
||||
|
||||
Bugs in the present implementation:
|
||||
- Coherent FS:
|
||||
- The "free list interleave" n:m is currently ignored.
|
||||
- Only file systems with no filesystem name and no pack name are recognized.
|
||||
(See Coherent "man mkfs" for a description of these features.)
|
||||
- SystemV Release 2 FS:
|
||||
The superblock is only searched in the blocks 9, 15, 18, which
|
||||
corresponds to the beginning of track 1 on floppy disks. No support
|
||||
for this FS on hard disk yet.
|
||||
|
||||
|
||||
Please report any bugs and suggestions to
|
||||
Bruno Haible <haible@ma2s2.mathematik.uni-karlsruhe.de>
|
||||
Pascal Haible <haible@izfm.uni-stuttgart.de>
|
||||
Krzysztof G. Baranowski <kgb@manjak.knm.org.pl>
|
||||
|
||||
Bruno Haible
|
||||
<haible@ma2s2.mathematik.uni-karlsruhe.de>
|
||||
|
100
Documentation/filesystems/tmpfs.txt
Normal file
100
Documentation/filesystems/tmpfs.txt
Normal file
@@ -0,0 +1,100 @@
|
||||
Tmpfs is a file system which keeps all files in virtual memory.
|
||||
|
||||
|
||||
Everything in tmpfs is temporary in the sense that no files will be
|
||||
created on your hard drive. If you unmount a tmpfs instance,
|
||||
everything stored therein is lost.
|
||||
|
||||
tmpfs puts everything into the kernel internal caches and grows and
|
||||
shrinks to accommodate the files it contains and is able to swap
|
||||
unneeded pages out to swap space. It has maximum size limits which can
|
||||
be adjusted on the fly via 'mount -o remount ...'
|
||||
|
||||
If you compare it to ramfs (which was the template to create tmpfs)
|
||||
you gain swapping and limit checking. Another similar thing is the RAM
|
||||
disk (/dev/ram*), which simulates a fixed size hard disk in physical
|
||||
RAM, where you have to create an ordinary filesystem on top. Ramdisks
|
||||
cannot swap and you do not have the possibility to resize them.
|
||||
|
||||
Since tmpfs lives completely in the page cache and on swap, all tmpfs
|
||||
pages currently in memory will show up as cached. It will not show up
|
||||
as shared or something like that. Further on you can check the actual
|
||||
RAM+swap use of a tmpfs instance with df(1) and du(1).
|
||||
|
||||
|
||||
tmpfs has the following uses:
|
||||
|
||||
1) There is always a kernel internal mount which you will not see at
|
||||
all. This is used for shared anonymous mappings and SYSV shared
|
||||
memory.
|
||||
|
||||
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
|
||||
set, the user visible part of tmpfs is not build. But the internal
|
||||
mechanisms are always present.
|
||||
|
||||
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
|
||||
POSIX shared memory (shm_open, shm_unlink). Adding the following
|
||||
line to /etc/fstab should take care of this:
|
||||
|
||||
tmpfs /dev/shm tmpfs defaults 0 0
|
||||
|
||||
Remember to create the directory that you intend to mount tmpfs on
|
||||
if necessary (/dev/shm is automagically created if you use devfs).
|
||||
|
||||
This mount is _not_ needed for SYSV shared memory. The internal
|
||||
mount is used for that. (In the 2.3 kernel versions it was
|
||||
necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
|
||||
shared memory)
|
||||
|
||||
3) Some people (including me) find it very convenient to mount it
|
||||
e.g. on /tmp and /var/tmp and have a big swap partition. And now
|
||||
loop mounts of tmpfs files do work, so mkinitrd shipped by most
|
||||
distributions should succeed with a tmpfs /tmp.
|
||||
|
||||
4) And probably a lot more I do not know about :-)
|
||||
|
||||
|
||||
tmpfs has three mount options for sizing:
|
||||
|
||||
size: The limit of allocated bytes for this tmpfs instance. The
|
||||
default is half of your physical RAM without swap. If you
|
||||
oversize your tmpfs instances the machine will deadlock
|
||||
since the OOM handler will not be able to free that memory.
|
||||
nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
|
||||
nr_inodes: The maximum number of inodes for this instance. The default
|
||||
is half of the number of your physical RAM pages, or (on a
|
||||
a machine with highmem) the number of lowmem RAM pages,
|
||||
whichever is the lower.
|
||||
|
||||
These parameters accept a suffix k, m or g for kilo, mega and giga and
|
||||
can be changed on remount. The size parameter also accepts a suffix %
|
||||
to limit this tmpfs instance to that percentage of your physical RAM:
|
||||
the default, when neither size nor nr_blocks is specified, is size=50%
|
||||
|
||||
If both nr_blocks (or size) and nr_inodes are set to 0, neither blocks
|
||||
nor inodes will be limited in that instance. It is generally unwise to
|
||||
mount with such options, since it allows any user with write access to
|
||||
use up all the memory on the machine; but enhances the scalability of
|
||||
that instance in a system with many cpus making intensive use of it.
|
||||
|
||||
|
||||
To specify the initial root directory you can use the following mount
|
||||
options:
|
||||
|
||||
mode: The permissions as an octal number
|
||||
uid: The user id
|
||||
gid: The group id
|
||||
|
||||
These options do not have any effect on remount. You can change these
|
||||
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
|
||||
|
||||
|
||||
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
|
||||
will give you tmpfs instance on /mytmpfs which can allocate 10GB
|
||||
RAM/SWAP in 10240 inodes and it is only accessible by root.
|
||||
|
||||
|
||||
Author:
|
||||
Christoph Rohland <cr@sap.com>, 1.12.01
|
||||
Updated:
|
||||
Hugh Dickins <hugh@veritas.com>, 01 September 2004
|
57
Documentation/filesystems/udf.txt
Normal file
57
Documentation/filesystems/udf.txt
Normal file
@@ -0,0 +1,57 @@
|
||||
*
|
||||
* Documentation/filesystems/udf.txt
|
||||
*
|
||||
UDF Filesystem version 0.9.8.1
|
||||
|
||||
If you encounter problems with reading UDF discs using this driver,
|
||||
please report them to linux_udf@hpesjro.fc.hp.com, which is the
|
||||
developer's list.
|
||||
|
||||
Write support requires a block driver which supports writing. The current
|
||||
scsi and ide cdrom drivers do not support writing.
|
||||
|
||||
-------------------------------------------------------------------------------
|
||||
The following mount options are supported:
|
||||
|
||||
gid= Set the default group.
|
||||
umask= Set the default umask.
|
||||
uid= Set the default user.
|
||||
bs= Set the block size.
|
||||
unhide Show otherwise hidden files.
|
||||
undelete Show deleted files in lists.
|
||||
adinicb Embed data in the inode (default)
|
||||
noadinicb Don't embed data in the inode
|
||||
shortad Use short ad's
|
||||
longad Use long ad's (default)
|
||||
nostrict Unset strict conformance
|
||||
iocharset= Set the NLS character set
|
||||
|
||||
The remaining are for debugging and disaster recovery:
|
||||
|
||||
novrs Skip volume sequence recognition
|
||||
|
||||
The following expect a offset from 0.
|
||||
|
||||
session= Set the CDROM session (default= last session)
|
||||
anchor= Override standard anchor location. (default= 256)
|
||||
volume= Override the VolumeDesc location. (unused)
|
||||
partition= Override the PartitionDesc location. (unused)
|
||||
lastblock= Set the last block of the filesystem/
|
||||
|
||||
The following expect a offset from the partition root.
|
||||
|
||||
fileset= Override the fileset block location. (unused)
|
||||
rootdir= Override the root directory location. (unused)
|
||||
WARNING: overriding the rootdir to a non-directory may
|
||||
yield highly unpredictable results.
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
|
||||
For the latest version and toolset see:
|
||||
http://linux-udf.sourceforge.net/
|
||||
|
||||
Documentation on UDF and ECMA 167 is available FREE from:
|
||||
http://www.osta.org/
|
||||
http://www.ecma-international.org/
|
||||
|
||||
Ben Fennema <bfennema@falcon.csc.calpoly.edu>
|
61
Documentation/filesystems/ufs.txt
Normal file
61
Documentation/filesystems/ufs.txt
Normal file
@@ -0,0 +1,61 @@
|
||||
USING UFS
|
||||
=========
|
||||
|
||||
mount -t ufs -o ufstype=type_of_ufs device dir
|
||||
|
||||
|
||||
UFS OPTIONS
|
||||
===========
|
||||
|
||||
ufstype=type_of_ufs
|
||||
UFS is a file system widely used in different operating systems.
|
||||
The problem are differences among implementations. Features of
|
||||
some implementations are undocumented, so its hard to recognize
|
||||
type of ufs automatically. That's why user must specify type of
|
||||
ufs manually by mount option ufstype. Possible values are:
|
||||
|
||||
old old format of ufs
|
||||
default value, supported as read-only
|
||||
|
||||
44bsd used in FreeBSD, NetBSD, OpenBSD
|
||||
supported as read-write
|
||||
|
||||
ufs2 used in FreeBSD 5.x
|
||||
supported as read-only
|
||||
|
||||
5xbsd synonym for ufs2
|
||||
|
||||
sun used in SunOS (Solaris)
|
||||
supported as read-write
|
||||
|
||||
sunx86 used in SunOS for Intel (Solarisx86)
|
||||
supported as read-write
|
||||
|
||||
hp used in HP-UX
|
||||
supported as read-only
|
||||
|
||||
nextstep
|
||||
used in NextStep
|
||||
supported as read-only
|
||||
|
||||
nextstep-cd
|
||||
used for NextStep CDROMs (block_size == 2048)
|
||||
supported as read-only
|
||||
|
||||
openstep
|
||||
used in OpenStep
|
||||
supported as read-only
|
||||
|
||||
|
||||
POSSIBLE PROBLEMS
|
||||
=================
|
||||
|
||||
There is still bug in reallocation of fragment, in file fs/ufs/balloc.c,
|
||||
line 364. But it seems working on current buffer cache configuration.
|
||||
|
||||
|
||||
BUG REPORTS
|
||||
===========
|
||||
|
||||
Any ufs bug report you can send to daniel.pirkl@email.cz (do not send
|
||||
partition tables bug reports.)
|
231
Documentation/filesystems/vfat.txt
Normal file
231
Documentation/filesystems/vfat.txt
Normal file
@@ -0,0 +1,231 @@
|
||||
USING VFAT
|
||||
----------------------------------------------------------------------
|
||||
To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
|
||||
mount -t vfat /dev/fd0 /mnt
|
||||
|
||||
No special partition formatter is required. mkdosfs will work fine
|
||||
if you want to format from within Linux.
|
||||
|
||||
VFAT MOUNT OPTIONS
|
||||
----------------------------------------------------------------------
|
||||
umask=### -- The permission mask (for files and directories, see umask(1)).
|
||||
The default is the umask of current process.
|
||||
|
||||
dmask=### -- The permission mask for the directory.
|
||||
The default is the umask of current process.
|
||||
|
||||
fmask=### -- The permission mask for files.
|
||||
The default is the umask of current process.
|
||||
|
||||
codepage=### -- Sets the codepage number for converting to shortname
|
||||
characters on FAT filesystem.
|
||||
By default, FAT_DEFAULT_CODEPAGE setting is used.
|
||||
|
||||
iocharset=name -- Character set to use for converting between the
|
||||
encoding is used for user visible filename and 16 bit
|
||||
Unicode characters. Long filenames are stored on disk
|
||||
in Unicode format, but Unix for the most part doesn't
|
||||
know how to deal with Unicode.
|
||||
By default, FAT_DEFAULT_IOCHARSET setting is used.
|
||||
|
||||
There is also an option of doing UTF8 translations
|
||||
with the utf8 option.
|
||||
|
||||
NOTE: "iocharset=utf8" is not recommended. If unsure,
|
||||
you should consider the following option instead.
|
||||
|
||||
utf8=<bool> -- UTF8 is the filesystem safe version of Unicode that
|
||||
is used by the console. It can be be enabled for the
|
||||
filesystem with this option. If 'uni_xlate' gets set,
|
||||
UTF8 gets disabled.
|
||||
|
||||
uni_xlate=<bool> -- Translate unhandled Unicode characters to special
|
||||
escaped sequences. This would let you backup and
|
||||
restore filenames that are created with any Unicode
|
||||
characters. Until Linux supports Unicode for real,
|
||||
this gives you an alternative. Without this option,
|
||||
a '?' is used when no translation is possible. The
|
||||
escape character is ':' because it is otherwise
|
||||
illegal on the vfat filesystem. The escape sequence
|
||||
that gets used is ':' and the four digits of hexadecimal
|
||||
unicode.
|
||||
|
||||
nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
|
||||
end in '~1' or tilde followed by some number. If this
|
||||
option is set, then if the filename is
|
||||
"longfilename.txt" and "longfile.txt" does not
|
||||
currently exist in the directory, 'longfile.txt' will
|
||||
be the short alias instead of 'longfi~1.txt'.
|
||||
|
||||
quiet -- Stops printing certain warning messages.
|
||||
|
||||
check=s|r|n -- Case sensitivity checking setting.
|
||||
s: strict, case sensitive
|
||||
r: relaxed, case insensitive
|
||||
n: normal, default setting, currently case insensitive
|
||||
|
||||
shortname=lower|win95|winnt|mixed
|
||||
-- Shortname display/create setting.
|
||||
lower: convert to lowercase for display,
|
||||
emulate the Windows 95 rule for create.
|
||||
win95: emulate the Windows 95 rule for display/create.
|
||||
winnt: emulate the Windows NT rule for display/create.
|
||||
mixed: emulate the Windows NT rule for display,
|
||||
emulate the Windows 95 rule for create.
|
||||
Default setting is `lower'.
|
||||
|
||||
<bool>: 0,1,yes,no,true,false
|
||||
|
||||
TODO
|
||||
----------------------------------------------------------------------
|
||||
* Need to get rid of the raw scanning stuff. Instead, always use
|
||||
a get next directory entry approach. The only thing left that uses
|
||||
raw scanning is the directory renaming code.
|
||||
|
||||
|
||||
POSSIBLE PROBLEMS
|
||||
----------------------------------------------------------------------
|
||||
* vfat_valid_longname does not properly checked reserved names.
|
||||
* When a volume name is the same as a directory name in the root
|
||||
directory of the filesystem, the directory name sometimes shows
|
||||
up as an empty file.
|
||||
* autoconv option does not work correctly.
|
||||
|
||||
BUG REPORTS
|
||||
----------------------------------------------------------------------
|
||||
If you have trouble with the VFAT filesystem, mail bug reports to
|
||||
chaffee@bmrc.cs.berkeley.edu. Please specify the filename
|
||||
and the operation that gave you trouble.
|
||||
|
||||
TEST SUITE
|
||||
----------------------------------------------------------------------
|
||||
If you plan to make any modifications to the vfat filesystem, please
|
||||
get the test suite that comes with the vfat distribution at
|
||||
|
||||
http://bmrc.berkeley.edu/people/chaffee/vfat.html
|
||||
|
||||
This tests quite a few parts of the vfat filesystem and additional
|
||||
tests for new features or untested features would be appreciated.
|
||||
|
||||
NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
|
||||
----------------------------------------------------------------------
|
||||
(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
|
||||
and lightly annotated by Gordon Chaffee).
|
||||
|
||||
This document presents a very rough, technical overview of my
|
||||
knowledge of the extended FAT file system used in Windows NT 3.5 and
|
||||
Windows 95. I don't guarantee that any of the following is correct,
|
||||
but it appears to be so.
|
||||
|
||||
The extended FAT file system is almost identical to the FAT
|
||||
file system used in DOS versions up to and including 6.223410239847
|
||||
:-). The significant change has been the addition of long file names.
|
||||
These names support up to 255 characters including spaces and lower
|
||||
case characters as opposed to the traditional 8.3 short names.
|
||||
|
||||
Here is the description of the traditional FAT entry in the current
|
||||
Windows 95 filesystem:
|
||||
|
||||
struct directory { // Short 8.3 names
|
||||
unsigned char name[8]; // file name
|
||||
unsigned char ext[3]; // file extension
|
||||
unsigned char attr; // attribute byte
|
||||
unsigned char lcase; // Case for base and extension
|
||||
unsigned char ctime_ms; // Creation time, milliseconds
|
||||
unsigned char ctime[2]; // Creation time
|
||||
unsigned char cdate[2]; // Creation date
|
||||
unsigned char adate[2]; // Last access date
|
||||
unsigned char reserved[2]; // reserved values (ignored)
|
||||
unsigned char time[2]; // time stamp
|
||||
unsigned char date[2]; // date stamp
|
||||
unsigned char start[2]; // starting cluster number
|
||||
unsigned char size[4]; // size of the file
|
||||
};
|
||||
|
||||
The lcase field specifies if the base and/or the extension of an 8.3
|
||||
name should be capitalized. This field does not seem to be used by
|
||||
Windows 95 but it is used by Windows NT. The case of filenames is not
|
||||
completely compatible from Windows NT to Windows 95. It is not completely
|
||||
compatible in the reverse direction, however. Filenames that fit in
|
||||
the 8.3 namespace and are written on Windows NT to be lowercase will
|
||||
show up as uppercase on Windows 95.
|
||||
|
||||
Note that the "start" and "size" values are actually little
|
||||
endian integer values. The descriptions of the fields in this
|
||||
structure are public knowledge and can be found elsewhere.
|
||||
|
||||
With the extended FAT system, Microsoft has inserted extra
|
||||
directory entries for any files with extended names. (Any name which
|
||||
legally fits within the old 8.3 encoding scheme does not have extra
|
||||
entries.) I call these extra entries slots. Basically, a slot is a
|
||||
specially formatted directory entry which holds up to 13 characters of
|
||||
a file's extended name. Think of slots as additional labeling for the
|
||||
directory entry of the file to which they correspond. Microsoft
|
||||
prefers to refer to the 8.3 entry for a file as its alias and the
|
||||
extended slot directory entries as the file name.
|
||||
|
||||
The C structure for a slot directory entry follows:
|
||||
|
||||
struct slot { // Up to 13 characters of a long name
|
||||
unsigned char id; // sequence number for slot
|
||||
unsigned char name0_4[10]; // first 5 characters in name
|
||||
unsigned char attr; // attribute byte
|
||||
unsigned char reserved; // always 0
|
||||
unsigned char alias_checksum; // checksum for 8.3 alias
|
||||
unsigned char name5_10[12]; // 6 more characters in name
|
||||
unsigned char start[2]; // starting cluster number
|
||||
unsigned char name11_12[4]; // last 2 characters in name
|
||||
};
|
||||
|
||||
If the layout of the slots looks a little odd, it's only
|
||||
because of Microsoft's efforts to maintain compatibility with old
|
||||
software. The slots must be disguised to prevent old software from
|
||||
panicking. To this end, a number of measures are taken:
|
||||
|
||||
1) The attribute byte for a slot directory entry is always set
|
||||
to 0x0f. This corresponds to an old directory entry with
|
||||
attributes of "hidden", "system", "read-only", and "volume
|
||||
label". Most old software will ignore any directory
|
||||
entries with the "volume label" bit set. Real volume label
|
||||
entries don't have the other three bits set.
|
||||
|
||||
2) The starting cluster is always set to 0, an impossible
|
||||
value for a DOS file.
|
||||
|
||||
Because the extended FAT system is backward compatible, it is
|
||||
possible for old software to modify directory entries. Measures must
|
||||
be taken to ensure the validity of slots. An extended FAT system can
|
||||
verify that a slot does in fact belong to an 8.3 directory entry by
|
||||
the following:
|
||||
|
||||
1) Positioning. Slots for a file always immediately proceed
|
||||
their corresponding 8.3 directory entry. In addition, each
|
||||
slot has an id which marks its order in the extended file
|
||||
name. Here is a very abbreviated view of an 8.3 directory
|
||||
entry and its corresponding long name slots for the file
|
||||
"My Big File.Extension which is long":
|
||||
|
||||
<proceeding files...>
|
||||
<slot #3, id = 0x43, characters = "h is long">
|
||||
<slot #2, id = 0x02, characters = "xtension whic">
|
||||
<slot #1, id = 0x01, characters = "My Big File.E">
|
||||
<directory entry, name = "MYBIGFIL.EXT">
|
||||
|
||||
Note that the slots are stored from last to first. Slots
|
||||
are numbered from 1 to N. The Nth slot is or'ed with 0x40
|
||||
to mark it as the last one.
|
||||
|
||||
2) Checksum. Each slot has an "alias_checksum" value. The
|
||||
checksum is calculated from the 8.3 name using the
|
||||
following algorithm:
|
||||
|
||||
for (sum = i = 0; i < 11; i++) {
|
||||
sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
|
||||
}
|
||||
|
||||
3) If there is free space in the final slot, a Unicode NULL (0x0000)
|
||||
is stored after the final character. After that, all unused
|
||||
characters in the final slot are set to Unicode 0xFFFF.
|
||||
|
||||
Finally, note that the extended name is stored in Unicode. Each Unicode
|
||||
character takes two bytes.
|
671
Documentation/filesystems/vfs.txt
Normal file
671
Documentation/filesystems/vfs.txt
Normal file
@@ -0,0 +1,671 @@
|
||||
/* -*- auto-fill -*- */
|
||||
|
||||
Overview of the Virtual File System
|
||||
|
||||
Richard Gooch <rgooch@atnf.csiro.au>
|
||||
|
||||
5-JUL-1999
|
||||
|
||||
|
||||
Conventions used in this document <section>
|
||||
=================================
|
||||
|
||||
Each section in this document will have the string "<section>" at the
|
||||
right-hand side of the section title. Each subsection will have
|
||||
"<subsection>" at the right-hand side. These strings are meant to make
|
||||
it easier to search through the document.
|
||||
|
||||
NOTE that the master copy of this document is available online at:
|
||||
http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
|
||||
|
||||
|
||||
What is it? <section>
|
||||
===========
|
||||
|
||||
The Virtual File System (otherwise known as the Virtual Filesystem
|
||||
Switch) is the software layer in the kernel that provides the
|
||||
filesystem interface to userspace programs. It also provides an
|
||||
abstraction within the kernel which allows different filesystem
|
||||
implementations to co-exist.
|
||||
|
||||
|
||||
A Quick Look At How It Works <section>
|
||||
============================
|
||||
|
||||
In this section I'll briefly describe how things work, before
|
||||
launching into the details. I'll start with describing what happens
|
||||
when user programs open and manipulate files, and then look from the
|
||||
other view which is how a filesystem is supported and subsequently
|
||||
mounted.
|
||||
|
||||
Opening a File <subsection>
|
||||
--------------
|
||||
|
||||
The VFS implements the open(2), stat(2), chmod(2) and similar system
|
||||
calls. The pathname argument is used by the VFS to search through the
|
||||
directory entry cache (dentry cache or "dcache"). This provides a very
|
||||
fast look-up mechanism to translate a pathname (filename) into a
|
||||
specific dentry.
|
||||
|
||||
An individual dentry usually has a pointer to an inode. Inodes are the
|
||||
things that live on disc drives, and can be regular files (you know:
|
||||
those things that you write data into), directories, FIFOs and other
|
||||
beasts. Dentries live in RAM and are never saved to disc: they exist
|
||||
only for performance. Inodes live on disc and are copied into memory
|
||||
when required. Later any changes are written back to disc. The inode
|
||||
that lives in RAM is a VFS inode, and it is this which the dentry
|
||||
points to. A single inode can be pointed to by multiple dentries
|
||||
(think about hardlinks).
|
||||
|
||||
The dcache is meant to be a view into your entire filespace. Unlike
|
||||
Linus, most of us losers can't fit enough dentries into RAM to cover
|
||||
all of our filespace, so the dcache has bits missing. In order to
|
||||
resolve your pathname into a dentry, the VFS may have to resort to
|
||||
creating dentries along the way, and then loading the inode. This is
|
||||
done by looking up the inode.
|
||||
|
||||
To look up an inode (usually read from disc) requires that the VFS
|
||||
calls the lookup() method of the parent directory inode. This method
|
||||
is installed by the specific filesystem implementation that the inode
|
||||
lives in. There will be more on this later.
|
||||
|
||||
Once the VFS has the required dentry (and hence the inode), we can do
|
||||
all those boring things like open(2) the file, or stat(2) it to peek
|
||||
at the inode data. The stat(2) operation is fairly simple: once the
|
||||
VFS has the dentry, it peeks at the inode data and passes some of it
|
||||
back to userspace.
|
||||
|
||||
Opening a file requires another operation: allocation of a file
|
||||
structure (this is the kernel-side implementation of file
|
||||
descriptors). The freshly allocated file structure is initialised with
|
||||
a pointer to the dentry and a set of file operation member functions.
|
||||
These are taken from the inode data. The open() file method is then
|
||||
called so the specific filesystem implementation can do it's work. You
|
||||
can see that this is another switch performed by the VFS.
|
||||
|
||||
The file structure is placed into the file descriptor table for the
|
||||
process.
|
||||
|
||||
Reading, writing and closing files (and other assorted VFS operations)
|
||||
is done by using the userspace file descriptor to grab the appropriate
|
||||
file structure, and then calling the required file structure method
|
||||
function to do whatever is required.
|
||||
|
||||
For as long as the file is open, it keeps the dentry "open" (in use),
|
||||
which in turn means that the VFS inode is still in use.
|
||||
|
||||
All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
|
||||
chmod(2) and so on) are called from a process context. You should
|
||||
assume that these calls are made without any kernel locks being
|
||||
held. This means that the processes may be executing the same piece of
|
||||
filesystem or driver code at the same time, on different
|
||||
processors. You should ensure that access to shared resources is
|
||||
protected by appropriate locks.
|
||||
|
||||
Registering and Mounting a Filesystem <subsection>
|
||||
-------------------------------------
|
||||
|
||||
If you want to support a new kind of filesystem in the kernel, all you
|
||||
need to do is call register_filesystem(). You pass a structure
|
||||
describing the filesystem implementation (struct file_system_type)
|
||||
which is then added to an internal table of supported filesystems. You
|
||||
can do:
|
||||
|
||||
% cat /proc/filesystems
|
||||
|
||||
to see what filesystems are currently available on your system.
|
||||
|
||||
When a request is made to mount a block device onto a directory in
|
||||
your filespace the VFS will call the appropriate method for the
|
||||
specific filesystem. The dentry for the mount point will then be
|
||||
updated to point to the root inode for the new filesystem.
|
||||
|
||||
It's now time to look at things in more detail.
|
||||
|
||||
|
||||
struct file_system_type <section>
|
||||
=======================
|
||||
|
||||
This describes the filesystem. As of kernel 2.1.99, the following
|
||||
members are defined:
|
||||
|
||||
struct file_system_type {
|
||||
const char *name;
|
||||
int fs_flags;
|
||||
struct super_block *(*read_super) (struct super_block *, void *, int);
|
||||
struct file_system_type * next;
|
||||
};
|
||||
|
||||
name: the name of the filesystem type, such as "ext2", "iso9660",
|
||||
"msdos" and so on
|
||||
|
||||
fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
|
||||
|
||||
read_super: the method to call when a new instance of this
|
||||
filesystem should be mounted
|
||||
|
||||
next: for internal VFS use: you should initialise this to NULL
|
||||
|
||||
The read_super() method has the following arguments:
|
||||
|
||||
struct super_block *sb: the superblock structure. This is partially
|
||||
initialised by the VFS and the rest must be initialised by the
|
||||
read_super() method
|
||||
|
||||
void *data: arbitrary mount options, usually comes as an ASCII
|
||||
string
|
||||
|
||||
int silent: whether or not to be silent on error
|
||||
|
||||
The read_super() method must determine if the block device specified
|
||||
in the superblock contains a filesystem of the type the method
|
||||
supports. On success the method returns the superblock pointer, on
|
||||
failure it returns NULL.
|
||||
|
||||
The most interesting member of the superblock structure that the
|
||||
read_super() method fills in is the "s_op" field. This is a pointer to
|
||||
a "struct super_operations" which describes the next level of the
|
||||
filesystem implementation.
|
||||
|
||||
|
||||
struct super_operations <section>
|
||||
=======================
|
||||
|
||||
This describes how the VFS can manipulate the superblock of your
|
||||
filesystem. As of kernel 2.1.99, the following members are defined:
|
||||
|
||||
struct super_operations {
|
||||
void (*read_inode) (struct inode *);
|
||||
int (*write_inode) (struct inode *, int);
|
||||
void (*put_inode) (struct inode *);
|
||||
void (*drop_inode) (struct inode *);
|
||||
void (*delete_inode) (struct inode *);
|
||||
int (*notify_change) (struct dentry *, struct iattr *);
|
||||
void (*put_super) (struct super_block *);
|
||||
void (*write_super) (struct super_block *);
|
||||
int (*statfs) (struct super_block *, struct statfs *, int);
|
||||
int (*remount_fs) (struct super_block *, int *, char *);
|
||||
void (*clear_inode) (struct inode *);
|
||||
};
|
||||
|
||||
All methods are called without any locks being held, unless otherwise
|
||||
noted. This means that most methods can block safely. All methods are
|
||||
only called from a process context (i.e. not from an interrupt handler
|
||||
or bottom half).
|
||||
|
||||
read_inode: this method is called to read a specific inode from the
|
||||
mounted filesystem. The "i_ino" member in the "struct inode"
|
||||
will be initialised by the VFS to indicate which inode to
|
||||
read. Other members are filled in by this method
|
||||
|
||||
write_inode: this method is called when the VFS needs to write an
|
||||
inode to disc. The second parameter indicates whether the write
|
||||
should be synchronous or not, not all filesystems check this flag.
|
||||
|
||||
put_inode: called when the VFS inode is removed from the inode
|
||||
cache. This method is optional
|
||||
|
||||
drop_inode: called when the last access to the inode is dropped,
|
||||
with the inode_lock spinlock held.
|
||||
|
||||
This method should be either NULL (normal unix filesystem
|
||||
semantics) or "generic_delete_inode" (for filesystems that do not
|
||||
want to cache inodes - causing "delete_inode" to always be
|
||||
called regardless of the value of i_nlink)
|
||||
|
||||
The "generic_delete_inode()" behaviour is equivalent to the
|
||||
old practice of using "force_delete" in the put_inode() case,
|
||||
but does not have the races that the "force_delete()" approach
|
||||
had.
|
||||
|
||||
delete_inode: called when the VFS wants to delete an inode
|
||||
|
||||
notify_change: called when VFS inode attributes are changed. If this
|
||||
is NULL the VFS falls back to the write_inode() method. This
|
||||
is called with the kernel lock held
|
||||
|
||||
put_super: called when the VFS wishes to free the superblock
|
||||
(i.e. unmount). This is called with the superblock lock held
|
||||
|
||||
write_super: called when the VFS superblock needs to be written to
|
||||
disc. This method is optional
|
||||
|
||||
statfs: called when the VFS needs to get filesystem statistics. This
|
||||
is called with the kernel lock held
|
||||
|
||||
remount_fs: called when the filesystem is remounted. This is called
|
||||
with the kernel lock held
|
||||
|
||||
clear_inode: called then the VFS clears the inode. Optional
|
||||
|
||||
The read_inode() method is responsible for filling in the "i_op"
|
||||
field. This is a pointer to a "struct inode_operations" which
|
||||
describes the methods that can be performed on individual inodes.
|
||||
|
||||
|
||||
struct inode_operations <section>
|
||||
=======================
|
||||
|
||||
This describes how the VFS can manipulate an inode in your
|
||||
filesystem. As of kernel 2.1.99, the following members are defined:
|
||||
|
||||
struct inode_operations {
|
||||
struct file_operations * default_file_ops;
|
||||
int (*create) (struct inode *,struct dentry *,int);
|
||||
int (*lookup) (struct inode *,struct dentry *);
|
||||
int (*link) (struct dentry *,struct inode *,struct dentry *);
|
||||
int (*unlink) (struct inode *,struct dentry *);
|
||||
int (*symlink) (struct inode *,struct dentry *,const char *);
|
||||
int (*mkdir) (struct inode *,struct dentry *,int);
|
||||
int (*rmdir) (struct inode *,struct dentry *);
|
||||
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
|
||||
int (*rename) (struct inode *, struct dentry *,
|
||||
struct inode *, struct dentry *);
|
||||
int (*readlink) (struct dentry *, char *,int);
|
||||
struct dentry * (*follow_link) (struct dentry *, struct dentry *);
|
||||
int (*readpage) (struct file *, struct page *);
|
||||
int (*writepage) (struct page *page, struct writeback_control *wbc);
|
||||
int (*bmap) (struct inode *,int);
|
||||
void (*truncate) (struct inode *);
|
||||
int (*permission) (struct inode *, int);
|
||||
int (*smap) (struct inode *,int);
|
||||
int (*updatepage) (struct file *, struct page *, const char *,
|
||||
unsigned long, unsigned int, int);
|
||||
int (*revalidate) (struct dentry *);
|
||||
};
|
||||
|
||||
Again, all methods are called without any locks being held, unless
|
||||
otherwise noted.
|
||||
|
||||
default_file_ops: this is a pointer to a "struct file_operations"
|
||||
which describes how to open and then manipulate open files
|
||||
|
||||
create: called by the open(2) and creat(2) system calls. Only
|
||||
required if you want to support regular files. The dentry you
|
||||
get should not have an inode (i.e. it should be a negative
|
||||
dentry). Here you will probably call d_instantiate() with the
|
||||
dentry and the newly created inode
|
||||
|
||||
lookup: called when the VFS needs to look up an inode in a parent
|
||||
directory. The name to look for is found in the dentry. This
|
||||
method must call d_add() to insert the found inode into the
|
||||
dentry. The "i_count" field in the inode structure should be
|
||||
incremented. If the named inode does not exist a NULL inode
|
||||
should be inserted into the dentry (this is called a negative
|
||||
dentry). Returning an error code from this routine must only
|
||||
be done on a real error, otherwise creating inodes with system
|
||||
calls like create(2), mknod(2), mkdir(2) and so on will fail.
|
||||
If you wish to overload the dentry methods then you should
|
||||
initialise the "d_dop" field in the dentry; this is a pointer
|
||||
to a struct "dentry_operations".
|
||||
This method is called with the directory inode semaphore held
|
||||
|
||||
link: called by the link(2) system call. Only required if you want
|
||||
to support hard links. You will probably need to call
|
||||
d_instantiate() just as you would in the create() method
|
||||
|
||||
unlink: called by the unlink(2) system call. Only required if you
|
||||
want to support deleting inodes
|
||||
|
||||
symlink: called by the symlink(2) system call. Only required if you
|
||||
want to support symlinks. You will probably need to call
|
||||
d_instantiate() just as you would in the create() method
|
||||
|
||||
mkdir: called by the mkdir(2) system call. Only required if you want
|
||||
to support creating subdirectories. You will probably need to
|
||||
call d_instantiate() just as you would in the create() method
|
||||
|
||||
rmdir: called by the rmdir(2) system call. Only required if you want
|
||||
to support deleting subdirectories
|
||||
|
||||
mknod: called by the mknod(2) system call to create a device (char,
|
||||
block) inode or a named pipe (FIFO) or socket. Only required
|
||||
if you want to support creating these types of inodes. You
|
||||
will probably need to call d_instantiate() just as you would
|
||||
in the create() method
|
||||
|
||||
readlink: called by the readlink(2) system call. Only required if
|
||||
you want to support reading symbolic links
|
||||
|
||||
follow_link: called by the VFS to follow a symbolic link to the
|
||||
inode it points to. Only required if you want to support
|
||||
symbolic links
|
||||
|
||||
|
||||
struct file_operations <section>
|
||||
======================
|
||||
|
||||
This describes how the VFS can manipulate an open file. As of kernel
|
||||
2.1.99, the following members are defined:
|
||||
|
||||
struct file_operations {
|
||||
loff_t (*llseek) (struct file *, loff_t, int);
|
||||
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
|
||||
ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
|
||||
int (*readdir) (struct file *, void *, filldir_t);
|
||||
unsigned int (*poll) (struct file *, struct poll_table_struct *);
|
||||
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
|
||||
int (*mmap) (struct file *, struct vm_area_struct *);
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*fsync) (struct file *, struct dentry *);
|
||||
int (*fasync) (struct file *, int);
|
||||
int (*check_media_change) (kdev_t dev);
|
||||
int (*revalidate) (kdev_t dev);
|
||||
int (*lock) (struct file *, int, struct file_lock *);
|
||||
};
|
||||
|
||||
Again, all methods are called without any locks being held, unless
|
||||
otherwise noted.
|
||||
|
||||
llseek: called when the VFS needs to move the file position index
|
||||
|
||||
read: called by read(2) and related system calls
|
||||
|
||||
write: called by write(2) and related system calls
|
||||
|
||||
readdir: called when the VFS needs to read the directory contents
|
||||
|
||||
poll: called by the VFS when a process wants to check if there is
|
||||
activity on this file and (optionally) go to sleep until there
|
||||
is activity. Called by the select(2) and poll(2) system calls
|
||||
|
||||
ioctl: called by the ioctl(2) system call
|
||||
|
||||
mmap: called by the mmap(2) system call
|
||||
|
||||
open: called by the VFS when an inode should be opened. When the VFS
|
||||
opens a file, it creates a new "struct file" and initialises
|
||||
the "f_op" file operations member with the "default_file_ops"
|
||||
field in the inode structure. It then calls the open method
|
||||
for the newly allocated file structure. You might think that
|
||||
the open method really belongs in "struct inode_operations",
|
||||
and you may be right. I think it's done the way it is because
|
||||
it makes filesystems simpler to implement. The open() method
|
||||
is a good place to initialise the "private_data" member in the
|
||||
file structure if you want to point to a device structure
|
||||
|
||||
release: called when the last reference to an open file is closed
|
||||
|
||||
fsync: called by the fsync(2) system call
|
||||
|
||||
fasync: called by the fcntl(2) system call when asynchronous
|
||||
(non-blocking) mode is enabled for a file
|
||||
|
||||
Note that the file operations are implemented by the specific
|
||||
filesystem in which the inode resides. When opening a device node
|
||||
(character or block special) most filesystems will call special
|
||||
support routines in the VFS which will locate the required device
|
||||
driver information. These support routines replace the filesystem file
|
||||
operations with those for the device driver, and then proceed to call
|
||||
the new open() method for the file. This is how opening a device file
|
||||
in the filesystem eventually ends up calling the device driver open()
|
||||
method. Note the devfs (the Device FileSystem) has a more direct path
|
||||
from device node to device driver (this is an unofficial kernel
|
||||
patch).
|
||||
|
||||
|
||||
Directory Entry Cache (dcache) <section>
|
||||
------------------------------
|
||||
|
||||
struct dentry_operations
|
||||
========================
|
||||
|
||||
This describes how a filesystem can overload the standard dentry
|
||||
operations. Dentries and the dcache are the domain of the VFS and the
|
||||
individual filesystem implementations. Device drivers have no business
|
||||
here. These methods may be set to NULL, as they are either optional or
|
||||
the VFS uses a default. As of kernel 2.1.99, the following members are
|
||||
defined:
|
||||
|
||||
struct dentry_operations {
|
||||
int (*d_revalidate)(struct dentry *);
|
||||
int (*d_hash) (struct dentry *, struct qstr *);
|
||||
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
|
||||
void (*d_delete)(struct dentry *);
|
||||
void (*d_release)(struct dentry *);
|
||||
void (*d_iput)(struct dentry *, struct inode *);
|
||||
};
|
||||
|
||||
d_revalidate: called when the VFS needs to revalidate a dentry. This
|
||||
is called whenever a name look-up finds a dentry in the
|
||||
dcache. Most filesystems leave this as NULL, because all their
|
||||
dentries in the dcache are valid
|
||||
|
||||
d_hash: called when the VFS adds a dentry to the hash table
|
||||
|
||||
d_compare: called when a dentry should be compared with another
|
||||
|
||||
d_delete: called when the last reference to a dentry is
|
||||
deleted. This means no-one is using the dentry, however it is
|
||||
still valid and in the dcache
|
||||
|
||||
d_release: called when a dentry is really deallocated
|
||||
|
||||
d_iput: called when a dentry loses its inode (just prior to its
|
||||
being deallocated). The default when this is NULL is that the
|
||||
VFS calls iput(). If you define this method, you must call
|
||||
iput() yourself
|
||||
|
||||
Each dentry has a pointer to its parent dentry, as well as a hash list
|
||||
of child dentries. Child dentries are basically like files in a
|
||||
directory.
|
||||
|
||||
Directory Entry Cache APIs
|
||||
--------------------------
|
||||
|
||||
There are a number of functions defined which permit a filesystem to
|
||||
manipulate dentries:
|
||||
|
||||
dget: open a new handle for an existing dentry (this just increments
|
||||
the usage count)
|
||||
|
||||
dput: close a handle for a dentry (decrements the usage count). If
|
||||
the usage count drops to 0, the "d_delete" method is called
|
||||
and the dentry is placed on the unused list if the dentry is
|
||||
still in its parents hash list. Putting the dentry on the
|
||||
unused list just means that if the system needs some RAM, it
|
||||
goes through the unused list of dentries and deallocates them.
|
||||
If the dentry has already been unhashed and the usage count
|
||||
drops to 0, in this case the dentry is deallocated after the
|
||||
"d_delete" method is called
|
||||
|
||||
d_drop: this unhashes a dentry from its parents hash list. A
|
||||
subsequent call to dput() will dellocate the dentry if its
|
||||
usage count drops to 0
|
||||
|
||||
d_delete: delete a dentry. If there are no other open references to
|
||||
the dentry then the dentry is turned into a negative dentry
|
||||
(the d_iput() method is called). If there are other
|
||||
references, then d_drop() is called instead
|
||||
|
||||
d_add: add a dentry to its parents hash list and then calls
|
||||
d_instantiate()
|
||||
|
||||
d_instantiate: add a dentry to the alias hash list for the inode and
|
||||
updates the "d_inode" member. The "i_count" member in the
|
||||
inode structure should be set/incremented. If the inode
|
||||
pointer is NULL, the dentry is called a "negative
|
||||
dentry". This function is commonly called when an inode is
|
||||
created for an existing negative dentry
|
||||
|
||||
d_lookup: look up a dentry given its parent and path name component
|
||||
It looks up the child of that given name from the dcache
|
||||
hash table. If it is found, the reference count is incremented
|
||||
and the dentry is returned. The caller must use d_put()
|
||||
to free the dentry when it finishes using it.
|
||||
|
||||
|
||||
RCU-based dcache locking model
|
||||
------------------------------
|
||||
|
||||
On many workloads, the most common operation on dcache is
|
||||
to look up a dentry, given a parent dentry and the name
|
||||
of the child. Typically, for every open(), stat() etc.,
|
||||
the dentry corresponding to the pathname will be looked
|
||||
up by walking the tree starting with the first component
|
||||
of the pathname and using that dentry along with the next
|
||||
component to look up the next level and so on. Since it
|
||||
is a frequent operation for workloads like multiuser
|
||||
environments and webservers, it is important to optimize
|
||||
this path.
|
||||
|
||||
Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
|
||||
in every component during path look-up. Since 2.5.10 onwards,
|
||||
fastwalk algorithm changed this by holding the dcache_lock
|
||||
at the beginning and walking as many cached path component
|
||||
dentries as possible. This signficantly decreases the number
|
||||
of acquisition of dcache_lock. However it also increases the
|
||||
lock hold time signficantly and affects performance in large
|
||||
SMP machines. Since 2.5.62 kernel, dcache has been using
|
||||
a new locking model that uses RCU to make dcache look-up
|
||||
lock-free.
|
||||
|
||||
The current dcache locking model is not very different from the existing
|
||||
dcache locking model. Prior to 2.5.62 kernel, dcache_lock
|
||||
protected the hash chain, d_child, d_alias, d_lru lists as well
|
||||
as d_inode and several other things like mount look-up. RCU-based
|
||||
changes affect only the way the hash chain is protected. For everything
|
||||
else the dcache_lock must be taken for both traversing as well as
|
||||
updating. The hash chain updations too take the dcache_lock.
|
||||
The significant change is the way d_lookup traverses the hash chain,
|
||||
it doesn't acquire the dcache_lock for this and rely on RCU to
|
||||
ensure that the dentry has not been *freed*.
|
||||
|
||||
|
||||
Dcache locking details
|
||||
----------------------
|
||||
For many multi-user workloads, open() and stat() on files are
|
||||
very frequently occurring operations. Both involve walking
|
||||
of path names to find the dentry corresponding to the
|
||||
concerned file. In 2.4 kernel, dcache_lock was held
|
||||
during look-up of each path component. Contention and
|
||||
cacheline bouncing of this global lock caused significant
|
||||
scalability problems. With the introduction of RCU
|
||||
in linux kernel, this was worked around by making
|
||||
the look-up of path components during path walking lock-free.
|
||||
|
||||
|
||||
Safe lock-free look-up of dcache hash table
|
||||
===========================================
|
||||
|
||||
Dcache is a complex data structure with the hash table entries
|
||||
also linked together in other lists. In 2.4 kernel, dcache_lock
|
||||
protected all the lists. We applied RCU only on hash chain
|
||||
walking. The rest of the lists are still protected by dcache_lock.
|
||||
Some of the important changes are :
|
||||
|
||||
1. The deletion from hash chain is done using hlist_del_rcu() macro which
|
||||
doesn't initialize next pointer of the deleted dentry and this
|
||||
allows us to walk safely lock-free while a deletion is happening.
|
||||
|
||||
2. Insertion of a dentry into the hash table is done using
|
||||
hlist_add_head_rcu() which take care of ordering the writes -
|
||||
the writes to the dentry must be visible before the dentry
|
||||
is inserted. This works in conjuction with hlist_for_each_rcu()
|
||||
while walking the hash chain. The only requirement is that
|
||||
all initialization to the dentry must be done before hlist_add_head_rcu()
|
||||
since we don't have dcache_lock protection while traversing
|
||||
the hash chain. This isn't different from the existing code.
|
||||
|
||||
3. The dentry looked up without holding dcache_lock by cannot be
|
||||
returned for walking if it is unhashed. It then may have a NULL
|
||||
d_inode or other bogosity since RCU doesn't protect the other
|
||||
fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
|
||||
indicate unhashed dentries and use this in conjunction with a
|
||||
per-dentry lock (d_lock). Once looked up without the dcache_lock,
|
||||
we acquire the per-dentry lock (d_lock) and check if the
|
||||
dentry is unhashed. If so, the look-up is failed. If not, the
|
||||
reference count of the dentry is increased and the dentry is returned.
|
||||
|
||||
4. Once a dentry is looked up, it must be ensured during the path
|
||||
walk for that component it doesn't go away. In pre-2.5.10 code,
|
||||
this was done holding a reference to the dentry. dcache_rcu does
|
||||
the same. In some sense, dcache_rcu path walking looks like
|
||||
the pre-2.5.10 version.
|
||||
|
||||
5. All dentry hash chain updations must take the dcache_lock as well as
|
||||
the per-dentry lock in that order. dput() does this to ensure
|
||||
that a dentry that has just been looked up in another CPU
|
||||
doesn't get deleted before dget() can be done on it.
|
||||
|
||||
6. There are several ways to do reference counting of RCU protected
|
||||
objects. One such example is in ipv4 route cache where
|
||||
deferred freeing (using call_rcu()) is done as soon as
|
||||
the reference count goes to zero. This cannot be done in
|
||||
the case of dentries because tearing down of dentries
|
||||
require blocking (dentry_iput()) which isn't supported from
|
||||
RCU callbacks. Instead, tearing down of dentries happen
|
||||
synchronously in dput(), but actual freeing happens later
|
||||
when RCU grace period is over. This allows safe lock-free
|
||||
walking of the hash chains, but a matched dentry may have
|
||||
been partially torn down. The checking of DCACHE_UNHASHED
|
||||
flag with d_lock held detects such dentries and prevents
|
||||
them from being returned from look-up.
|
||||
|
||||
|
||||
Maintaining POSIX rename semantics
|
||||
==================================
|
||||
|
||||
Since look-up of dentries is lock-free, it can race against
|
||||
a concurrent rename operation. For example, during rename
|
||||
of file A to B, look-up of either A or B must succeed.
|
||||
So, if look-up of B happens after A has been removed from the
|
||||
hash chain but not added to the new hash chain, it may fail.
|
||||
Also, a comparison while the name is being written concurrently
|
||||
by a rename may result in false positive matches violating
|
||||
rename semantics. Issues related to race with rename are
|
||||
handled as described below :
|
||||
|
||||
1. Look-up can be done in two ways - d_lookup() which is safe
|
||||
from simultaneous renames and __d_lookup() which is not.
|
||||
If __d_lookup() fails, it must be followed up by a d_lookup()
|
||||
to correctly determine whether a dentry is in the hash table
|
||||
or not. d_lookup() protects look-ups using a sequence
|
||||
lock (rename_lock).
|
||||
|
||||
2. The name associated with a dentry (d_name) may be changed if
|
||||
a rename is allowed to happen simultaneously. To avoid memcmp()
|
||||
in __d_lookup() go out of bounds due to a rename and false
|
||||
positive comparison, the name comparison is done while holding the
|
||||
per-dentry lock. This prevents concurrent renames during this
|
||||
operation.
|
||||
|
||||
3. Hash table walking during look-up may move to a different bucket as
|
||||
the current dentry is moved to a different bucket due to rename.
|
||||
But we use hlists in dcache hash table and they are null-terminated.
|
||||
So, even if a dentry moves to a different bucket, hash chain
|
||||
walk will terminate. [with a list_head list, it may not since
|
||||
termination is when the list_head in the original bucket is reached].
|
||||
Since we redo the d_parent check and compare name while holding
|
||||
d_lock, lock-free look-up will not race against d_move().
|
||||
|
||||
4. There can be a theoritical race when a dentry keeps coming back
|
||||
to original bucket due to double moves. Due to this look-up may
|
||||
consider that it has never moved and can end up in a infinite loop.
|
||||
But this is not any worse that theoritical livelocks we already
|
||||
have in the kernel.
|
||||
|
||||
|
||||
Important guidelines for filesystem developers related to dcache_rcu
|
||||
====================================================================
|
||||
|
||||
1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
|
||||
don't change. Only dcache internal implementation changes. However
|
||||
filesystems *must not* delete from the dentry hash chains directly
|
||||
using the list macros like allowed earlier. They must use dcache
|
||||
APIs like d_drop() or __d_drop() depending on the situation.
|
||||
|
||||
2. d_flags is now protected by a per-dentry lock (d_lock). All
|
||||
access to d_flags must be protected by it.
|
||||
|
||||
3. For a hashed dentry, checking of d_count needs to be protected
|
||||
by d_lock.
|
||||
|
||||
|
||||
Papers and other documentation on dcache locking
|
||||
================================================
|
||||
|
||||
1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
|
||||
|
||||
2. http://lse.sourceforge.net/locking/dcache/dcache.html
|
188
Documentation/filesystems/xfs.txt
Normal file
188
Documentation/filesystems/xfs.txt
Normal file
@@ -0,0 +1,188 @@
|
||||
|
||||
The SGI XFS Filesystem
|
||||
======================
|
||||
|
||||
XFS is a high performance journaling filesystem which originated
|
||||
on the SGI IRIX platform. It is completely multi-threaded, can
|
||||
support large files and large filesystems, extended attributes,
|
||||
variable block sizes, is extent based, and makes extensive use of
|
||||
Btrees (directories, extents, free space) to aid both performance
|
||||
and scalability.
|
||||
|
||||
Refer to the documentation at http://oss.sgi.com/projects/xfs/
|
||||
for further details. This implementation is on-disk compatible
|
||||
with the IRIX version of XFS.
|
||||
|
||||
|
||||
Mount Options
|
||||
=============
|
||||
|
||||
When mounting an XFS filesystem, the following options are accepted.
|
||||
|
||||
biosize=size
|
||||
Sets the preferred buffered I/O size (default size is 64K).
|
||||
"size" must be expressed as the logarithm (base2) of the
|
||||
desired I/O size.
|
||||
Valid values for this option are 14 through 16, inclusive
|
||||
(i.e. 16K, 32K, and 64K bytes). On machines with a 4K
|
||||
pagesize, 13 (8K bytes) is also a valid size.
|
||||
The preferred buffered I/O size can also be altered on an
|
||||
individual file basis using the ioctl(2) system call.
|
||||
|
||||
ikeep/noikeep
|
||||
When inode clusters are emptied of inodes, keep them around
|
||||
on the disk (ikeep) - this is the traditional XFS behaviour
|
||||
and is still the default for now. Using the noikeep option,
|
||||
inode clusters are returned to the free space pool.
|
||||
|
||||
logbufs=value
|
||||
Set the number of in-memory log buffers. Valid numbers range
|
||||
from 2-8 inclusive.
|
||||
The default value is 8 buffers for filesystems with a
|
||||
blocksize of 64K, 4 buffers for filesystems with a blocksize
|
||||
of 32K, 3 buffers for filesystems with a blocksize of 16K
|
||||
and 2 buffers for all other configurations. Increasing the
|
||||
number of buffers may increase performance on some workloads
|
||||
at the cost of the memory used for the additional log buffers
|
||||
and their associated control structures.
|
||||
|
||||
logbsize=value
|
||||
Set the size of each in-memory log buffer.
|
||||
Size may be specified in bytes, or in kilobytes with a "k" suffix.
|
||||
Valid sizes for version 1 and version 2 logs are 16384 (16k) and
|
||||
32768 (32k). Valid sizes for version 2 logs also include
|
||||
65536 (64k), 131072 (128k) and 262144 (256k).
|
||||
The default value for machines with more than 32MB of memory
|
||||
is 32768, machines with less memory use 16384 by default.
|
||||
|
||||
logdev=device and rtdev=device
|
||||
Use an external log (metadata journal) and/or real-time device.
|
||||
An XFS filesystem has up to three parts: a data section, a log
|
||||
section, and a real-time section. The real-time section is
|
||||
optional, and the log section can be separate from the data
|
||||
section or contained within it.
|
||||
|
||||
noalign
|
||||
Data allocations will not be aligned at stripe unit boundaries.
|
||||
|
||||
noatime
|
||||
Access timestamps are not updated when a file is read.
|
||||
|
||||
norecovery
|
||||
The filesystem will be mounted without running log recovery.
|
||||
If the filesystem was not cleanly unmounted, it is likely to
|
||||
be inconsistent when mounted in "norecovery" mode.
|
||||
Some files or directories may not be accessible because of this.
|
||||
Filesystems mounted "norecovery" must be mounted read-only or
|
||||
the mount will fail.
|
||||
|
||||
nouuid
|
||||
Don't check for double mounted file systems using the file system uuid.
|
||||
This is useful to mount LVM snapshot volumes.
|
||||
|
||||
osyncisosync
|
||||
Make O_SYNC writes implement true O_SYNC. WITHOUT this option,
|
||||
Linux XFS behaves as if an "osyncisdsync" option is used,
|
||||
which will make writes to files opened with the O_SYNC flag set
|
||||
behave as if the O_DSYNC flag had been used instead.
|
||||
This can result in better performance without compromising
|
||||
data safety.
|
||||
However if this option is not in effect, timestamp updates from
|
||||
O_SYNC writes can be lost if the system crashes.
|
||||
If timestamp updates are critical, use the osyncisosync option.
|
||||
|
||||
quota/usrquota/uqnoenforce
|
||||
User disk quota accounting enabled, and limits (optionally)
|
||||
enforced.
|
||||
|
||||
grpquota/gqnoenforce
|
||||
Group disk quota accounting enabled and limits (optionally)
|
||||
enforced.
|
||||
|
||||
sunit=value and swidth=value
|
||||
Used to specify the stripe unit and width for a RAID device or
|
||||
a stripe volume. "value" must be specified in 512-byte block
|
||||
units.
|
||||
If this option is not specified and the filesystem was made on
|
||||
a stripe volume or the stripe width or unit were specified for
|
||||
the RAID device at mkfs time, then the mount system call will
|
||||
restore the value from the superblock. For filesystems that
|
||||
are made directly on RAID devices, these options can be used
|
||||
to override the information in the superblock if the underlying
|
||||
disk layout changes after the filesystem has been created.
|
||||
The "swidth" option is required if the "sunit" option has been
|
||||
specified, and must be a multiple of the "sunit" value.
|
||||
|
||||
sysctls
|
||||
=======
|
||||
|
||||
The following sysctls are available for the XFS filesystem:
|
||||
|
||||
fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
|
||||
Setting this to "1" clears accumulated XFS statistics
|
||||
in /proc/fs/xfs/stat. It then immediately resets to "0".
|
||||
|
||||
fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
|
||||
The interval at which the xfssyncd thread flushes metadata
|
||||
out to disk. This thread will flush log activity out, and
|
||||
do some processing on unlinked inodes.
|
||||
|
||||
fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000)
|
||||
The interval at which xfsbufd scans the dirty metadata buffers list.
|
||||
|
||||
fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000)
|
||||
The age at which xfsbufd flushes dirty metadata buffers to disk.
|
||||
|
||||
fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
|
||||
A volume knob for error reporting when internal errors occur.
|
||||
This will generate detailed messages & backtraces for filesystem
|
||||
shutdowns, for example. Current threshold values are:
|
||||
|
||||
XFS_ERRLEVEL_OFF: 0
|
||||
XFS_ERRLEVEL_LOW: 1
|
||||
XFS_ERRLEVEL_HIGH: 5
|
||||
|
||||
fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)
|
||||
Causes certain error conditions to call BUG(). Value is a bitmask;
|
||||
AND together the tags which represent errors which should cause panics:
|
||||
|
||||
XFS_NO_PTAG 0
|
||||
XFS_PTAG_IFLUSH 0x00000001
|
||||
XFS_PTAG_LOGRES 0x00000002
|
||||
XFS_PTAG_AILDELETE 0x00000004
|
||||
XFS_PTAG_ERROR_REPORT 0x00000008
|
||||
XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010
|
||||
XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
|
||||
XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
|
||||
|
||||
This option is intended for debugging only.
|
||||
|
||||
fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
|
||||
Controls whether symlinks are created with mode 0777 (default)
|
||||
or whether their mode is affected by the umask (irix mode).
|
||||
|
||||
fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
|
||||
Controls files created in SGID directories.
|
||||
If the group ID of the new file does not match the effective group
|
||||
ID or one of the supplementary group IDs of the parent dir, the
|
||||
ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
|
||||
is set.
|
||||
|
||||
fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1)
|
||||
Controls whether unprivileged users can use chown to "give away"
|
||||
a file to another user.
|
||||
|
||||
fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1)
|
||||
Setting this to "1" will cause the "sync" flag set
|
||||
by the chattr(1) command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1)
|
||||
Setting this to "1" will cause the "nodump" flag set
|
||||
by the chattr(1) command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1)
|
||||
Setting this to "1" will cause the "noatime" flag set
|
||||
by the chattr(1) command on a directory to be
|
||||
inherited by files in that directory.
|
Reference in New Issue
Block a user