Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
This commit is contained in:
Linus Torvalds
2005-04-16 15:20:36 -07:00
commit 1da177e4c3
17291 changed files with 6718755 additions and 0 deletions

View File

@@ -0,0 +1,50 @@
00-INDEX
- this file (info on some of the filesystems supported by linux).
Locking
- info on locking rules as they pertain to Linux VFS.
adfs.txt
- info and mount options for the Acorn Advanced Disc Filing System.
affs.txt
- info and mount options for the Amiga Fast File System.
bfs.txt
- info for the SCO UnixWare Boot Filesystem (BFS).
cifs.txt
- description of the CIFS filesystem
coda.txt
- description of the CODA filesystem.
cramfs.txt
- info on the cram filesystem for small storage (ROMs etc)
devfs/
- directory containing devfs documentation.
ext2.txt
- info, mount options and specifications for the Ext2 filesystem.
fat_cvf.txt
- info on the Compressed Volume Files extension to the FAT filesystem
hpfs.txt
- info and mount options for the OS/2 HPFS.
isofs.txt
- info and mount options for the ISO 9660 (CDROM) filesystem.
jfs.txt
- info and mount options for the JFS filesystem.
ncpfs.txt
- info on Novell Netware(tm) filesystem using NCP protocol.
ntfs.txt
- info and mount options for the NTFS filesystem (Windows NT).
proc.txt
- info on Linux's /proc filesystem.
romfs.txt
- Description of the ROMFS filesystem.
smbfs.txt
- info on using filesystems with the SMB protocol (Windows 3.11 and NT)
sysv-fs.txt
- info on the SystemV/V7/Xenix/Coherent filesystem.
udf.txt
- info and mount options for the UDF filesystem.
ufs.txt
- info on the ufs filesystem.
vfat.txt
- info on using the VFAT filesystem used in Windows NT and Windows 95
vfs.txt
- Overview of the Virtual File System
xfs.txt
- info and mount options for the XFS filesystem.

View File

@@ -0,0 +1,176 @@
Making Filesystems Exportable
=============================
Most filesystem operations require a dentry (or two) as a starting
point. Local applications have a reference-counted hold on suitable
dentrys via open file descriptors or cwd/root. However remote
applications that access a filesystem via a remote filesystem protocol
such as NFS may not be able to hold such a reference, and so need a
different way to refer to a particular dentry. As the alternative
form of reference needs to be stable across renames, truncates, and
server-reboot (among other things, though these tend to be the most
problematic), there is no simple answer like 'filename'.
The mechanism discussed here allows each filesystem implementation to
specify how to generate an opaque (out side of the filesystem) byte
string for any dentry, and how to find an appropriate dentry for any
given opaque byte string.
This byte string will be called a "filehandle fragment" as it
corresponds to part of an NFS filehandle.
A filesystem which supports the mapping between filehandle fragments
and dentrys will be termed "exportable".
Dcache Issues
-------------
The dcache normally contains a proper prefix of any given filesystem
tree. This means that if any filesystem object is in the dcache, then
all of the ancestors of that filesystem object are also in the dcache.
As normal access is by filename this prefix is created naturally and
maintained easily (by each object maintaining a reference count on
its parent).
However when objects are included into the dcache by interpreting a
filehandle fragment, there is no automatic creation of a path prefix
for the object. This leads to two related but distinct features of
the dcache that are not needed for normal filesystem access.
1/ The dcache must sometimes contain objects that are not part of the
proper prefix. i.e that are not connected to the root.
2/ The dcache must be prepared for a newly found (via ->lookup) directory
to already have a (non-connected) dentry, and must be able to move
that dentry into place (based on the parent and name in the
->lookup). This is particularly needed for directories as
it is a dcache invariant that directories only have one dentry.
To implement these features, the dcache has:
a/ A dentry flag DCACHE_DISCONNECTED which is set on
any dentry that might not be part of the proper prefix.
This is set when anonymous dentries are created, and cleared when a
dentry is noticed to be a child of a dentry which is in the proper
prefix.
b/ A per-superblock list "s_anon" of dentries which are the roots of
subtrees that are not in the proper prefix. These dentries, as
well as the proper prefix, need to be released at unmount time. As
these dentries will not be hashed, they are linked together on the
d_hash list_head.
c/ Helper routines to allocate anonymous dentries, and to help attach
loose directory dentries at lookup time. They are:
d_alloc_anon(inode) will return a dentry for the given inode.
If the inode already has a dentry, one of those is returned.
If it doesn't, a new anonymous (IS_ROOT and
DCACHE_DISCONNECTED) dentry is allocated and attached.
In the case of a directory, care is taken that only one dentry
can ever be attached.
d_splice_alias(inode, dentry) will make sure that there is a
dentry with the same name and parent as the given dentry, and
which refers to the given inode.
If the inode is a directory and already has a dentry, then that
dentry is d_moved over the given dentry.
If the passed dentry gets attached, care is taken that this is
mutually exclusive to a d_alloc_anon operation.
If the passed dentry is used, NULL is returned, else the used
dentry is returned. This corresponds to the calling pattern of
->lookup.
Filesystem Issues
-----------------
For a filesystem to be exportable it must:
1/ provide the filehandle fragment routines described below.
2/ make sure that d_splice_alias is used rather than d_add
when ->lookup finds an inode for a given parent and name.
Typically the ->lookup routine will end:
if (inode)
return d_splice(inode, dentry);
d_add(dentry, inode);
return NULL;
}
A file system implementation declares that instances of the filesystem
are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which could potentially be full of NULLs, though normally at
least get_parent will be set.
The primary operations are decode_fh and encode_fh.
decode_fh takes a filehandle fragment and tries to find or create a
dentry for the object referred to by the filehandle.
encode_fh takes a dentry and creates a filehandle fragment which can
later be used to find/create a dentry for the same object.
decode_fh will probably make use of "find_exported_dentry".
This function lives in the "exportfs" module which a filesystem does
not need unless it is being exported. So rather that calling
find_exported_dentry directly, each filesystem should call it through
the find_exported_dentry pointer in it's export_operations table.
This field is set correctly by the exporting agent (e.g. nfsd) when a
filesystem is exported, and before any export operations are called.
find_exported_dentry needs three support functions from the
filesystem:
get_name. When given a parent dentry and a child dentry, this
should find a name in the directory identified by the parent
dentry, which leads to the object identified by the child dentry.
If no get_name function is supplied, a default implementation is
provided which uses vfs_readdir to find potential names, and
matches inode numbers to find the correct match.
get_parent. When given a dentry for a directory, this should return
a dentry for the parent. Quite possibly the parent dentry will
have been allocated by d_alloc_anon.
The default get_parent function just returns an error so any
filehandle lookup that requires finding a parent will fail.
->lookup("..") is *not* used as a default as it can leave ".."
entries in the dcache which are too messy to work with.
get_dentry. When given an opaque datum, this should find the
implied object and create a dentry for it (possibly with
d_alloc_anon).
The opaque datum is whatever is passed down by the decode_fh
function, and is often simply a fragment of the filehandle
fragment.
decode_fh passes two datums through find_exported_dentry. One that
should be used to identify the target object, and one that can be
used to identify the object's parent, should that be necessary.
The default get_dentry function assumes that the datum contains an
inode number and a generation number, and it attempts to get the
inode using "iget" and check it's validity by matching the
generation number. A filesystem should only depend on the default
if iget can safely be used this way.
If decode_fh and/or encode_fh are left as NULL, then default
implementations are used. These defaults are suitable for ext2 and
extremely similar filesystems (like ext3).
The default encode_fh creates a filehandle fragment from the inode
number and generation number of the target together with the inode
number and generation number of the parent (if the parent is
required).
The default decode_fh extract the target and parent datums from the
filehandle assuming the format used by the default encode_fh and
passed them to find_exported_dentry.
A filehandle fragment consists of an array of 1 or more 4byte words,
together with a one byte "type".
The decode_fh routine should not depend on the stated size that is
passed to it. This size may be larger than the original filehandle
generated by encode_fh, in which case it will have been padded with
nuls. Rather, the encode_fh routine should choose a "type" which
indicates the decode_fh how much of the filehandle is valid, and how
it should be interpreted.

View File

@@ -0,0 +1,515 @@
The text below describes the locking rules for VFS-related methods.
It is (believed to be) up-to-date. *Please*, if you change anything in
prototypes or locking protocols - update this file. And update the relevant
instances in the tree, don't leave that to maintainers of filesystems/devices/
etc. At the very least, put the list of dubious cases in the end of this file.
Don't turn it into log - maintainers of out-of-the-tree code are supposed to
be able to use diff(1).
Thing currently missing here: socket operations. Alexey?
--------------------------- dentry_operations --------------------------
prototypes:
int (*d_revalidate)(struct dentry *, int);
int (*d_hash) (struct dentry *, struct qstr *);
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
locking rules:
none have BKL
dcache_lock rename_lock ->d_lock may block
d_revalidate: no no no yes
d_hash no no no yes
d_compare: no yes no no
d_delete: yes no yes no
d_release: no no no yes
d_iput: no no no yes
--------------------------- inode_operations ---------------------------
prototypes:
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid
ata *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
int (*follow_link) (struct dentry *, struct nameidata *);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int, struct nameidata *);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
locking rules:
all may block, none have BKL
i_sem(inode)
lookup: yes
create: yes
link: yes (both)
mknod: yes
symlink: yes
mkdir: yes
unlink: yes (both)
rmdir: yes (both) (see below)
rename: yes (all) (see below)
readlink: no
follow_link: no
truncate: yes (see below)
setattr: yes
permission: no
getattr: no
setxattr: yes
getxattr: no
listxattr: no
removexattr: yes
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
->truncate() is never called directly - it's a callback, not a
method. It's called by vmtruncate() - library function normally used by
->setattr(). Locking information above applies to that call (i.e. is
inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
passed).
See Documentation/filesystems/directory-locking for more detailed discussion
of the locking scheme for directory operations.
--------------------------- super_operations ---------------------------
prototypes:
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*read_inode) (struct inode *);
void (*dirty_inode) (struct inode *);
int (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
void (*write_super_lockfs) (struct super_block *);
void (*unlockfs) (struct super_block *);
int (*statfs) (struct super_block *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct vfsmount *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
locking rules:
All may block.
BKL s_lock s_umount
alloc_inode: no no no
destroy_inode: no
read_inode: no (see below)
dirty_inode: no (must not sleep)
write_inode: no
put_inode: no
drop_inode: no !!!inode_lock!!!
delete_inode: no
put_super: yes yes no
write_super: no yes read
sync_fs: no no read
write_super_lockfs: ?
unlockfs: ?
statfs: no no no
remount_fs: no yes maybe (see below)
clear_inode: no
umount_begin: yes no no
show_options: no (vfsmount->sem)
quota_read: no no no (see below)
quota_write: no no no (see below)
->read_inode() is not a method - it's a callback used in iget().
->remount_fs() will have the s_umount lock if it's already mounted.
When called from get_sb_single, it does NOT have the s_umount lock.
->quota_read() and ->quota_write() functions are both guaranteed to
be the only ones operating on the quota file by the quota code (via
dqio_sem) (unless an admin really wants to screw up something and
writes to quota files with quotas on). For other details about locking
see also dquot_operations section.
--------------------------- file_system_type ---------------------------
prototypes:
struct super_block *(*get_sb) (struct file_system_type *, int,
const char *, void *);
void (*kill_sb) (struct super_block *);
locking rules:
may block BKL
get_sb yes yes
kill_sb yes yes
->get_sb() returns error or a locked superblock (exclusive on ->s_umount).
->kill_sb() takes a write-locked superblock, does all shutdown work on it,
unlocks and drops the reference.
--------------------------- address_space_operations --------------------------
prototypes:
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
int (*sync_page)(struct page *);
int (*writepages)(struct address_space *, struct writeback_control *);
int (*set_page_dirty)(struct page *page);
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
locking rules:
All except set_page_dirty may block
BKL PageLocked(page)
writepage: no yes, unlocks (see below)
readpage: no yes, unlocks
sync_page: no maybe
writepages: no
set_page_dirty no no
readpages: no
prepare_write: no yes
commit_write: no yes
bmap: yes
invalidatepage: no yes
releasepage: no yes
direct_IO: no
->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
may be called from the request handler (/dev/loop).
->readpage() unlocks the page, either synchronously or via I/O
completion.
->readpages() populates the pagecache with the passed pages and starts
I/O against them. They come unlocked upon I/O completion.
->writepage() is used for two purposes: for "memory cleansing" and for
"sync". These are quite different operations and the behaviour may differ
depending upon the mode.
If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
it *must* start I/O against the page, even if that would involve
blocking on in-progress I/O.
If writepage is called for memory cleansing (sync_mode ==
WBC_SYNC_NONE) then its role is to get as much writeout underway as
possible. So writepage should try to avoid blocking against
currently-in-progress I/O.
If the filesystem is not called for "sync" and it determines that it
would need to block against in-progress I/O to be able to start new I/O
against the page the filesystem should redirty the page with
redirty_page_for_writepage(), then unlock the page and return zero.
This may also be done to avoid internal deadlocks, but rarely.
If the filesytem is called for sync then it must wait on any
in-progress I/O and then start new I/O.
The filesystem should unlock the page synchronously, before returning
to the caller.
Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
and return zero, writepage *must* run set_page_writeback() against the page,
followed by unlocking it. Once set_page_writeback() has been run against the
page, write I/O can be submitted and the write I/O completion handler must run
end_page_writeback() once the I/O is complete. If no I/O is submitted, the
filesystem must run end_page_writeback() against the page before returning from
writepage.
That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
if the filesystem needs the page to be locked during writeout, that is ok, too,
the page is allowed to be unlocked at any point in time between the calls to
set_page_writeback() and end_page_writeback().
Note, failure to run either redirty_page_for_writepage() or the combination of
set_page_writeback()/end_page_writeback() on a page submitted to writepage
will leave the page itself marked clean but it will be tagged as dirty in the
radix tree. This incoherency can lead to all sorts of hard-to-debug problems
in the filesystem like having dirty inodes at umount and losing written data.
->sync_page() locking rules are not well-defined - usually it is called
with lock on page, but that is not guaranteed. Considering the currently
existing instances of this method ->sync_page() itself doesn't look
well-defined...
->writepages() is used for periodic writeback and for syscall-initiated
sync operations. The address_space should start I/O against at least
*nr_to_write pages. *nr_to_write must be decremented for each page which is
written. The address_space implementation may write more (or less) pages
than *nr_to_write asks for, but it should try to be reasonably close. If
nr_to_write is NULL, all dirty pages must be written.
writepages should _only_ write pages which are present on
mapping->io_pages.
->set_page_dirty() is called from various places in the kernel
when the target page is marked as needing writeback. It may be called
under spinlock (it cannot block) and is sometimes called with the page
not locked.
->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
filesystems and by the swapper. The latter will eventually go away. All
instances do not actually need the BKL. Please, keep it that way and don't
breed new callers.
->invalidatepage() is called when the filesystem must attempt to drop
some or all of the buffers from the page when it is being truncated. It
returns zero on success. If ->invalidatepage is zero, the kernel uses
block_invalidatepage() instead.
->releasepage() is called when the kernel is about to try to drop the
buffers from the page in preparation for freeing it. It returns zero to
indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
the kernel assumes that the fs has no private interest in the buffers.
Note: currently almost all instances of address_space methods are
using BKL for internal serialization and that's one of the worst sources
of contention. Normally they are calling library functions (in fs/buffer.c)
and pass foo_get_block() as a callback (on local block-based filesystems,
indeed). BKL is not needed for library stuff and is usually taken by
foo_get_block(). It's an overkill, since block bitmaps can be protected by
internal fs locking and real critical areas are much smaller than the areas
filesystems protect now.
----------------------- file_lock_operations ------------------------------
prototypes:
void (*fl_insert)(struct file_lock *); /* lock insertion callback */
void (*fl_remove)(struct file_lock *); /* lock removal callback */
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
void (*fl_release_private)(struct file_lock *);
locking rules:
BKL may block
fl_insert: yes no
fl_remove: yes no
fl_copy_lock: yes no
fl_release_private: yes yes
----------------------- lock_manager_operations ---------------------------
prototypes:
int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
void (*fl_notify)(struct file_lock *); /* unblock callback */
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
void (*fl_release_private)(struct file_lock *);
void (*fl_break)(struct file_lock *); /* break_lease callback */
locking rules:
BKL may block
fl_compare_owner: yes no
fl_notify: yes no
fl_copy_lock: yes no
fl_release_private: yes yes
fl_break: yes no
Currently only NFSD and NLM provide instances of this class. None of the
them block. If you have out-of-tree instances - please, show up. Locking
in that area will change.
--------------------------- buffer_head -----------------------------------
prototypes:
void (*b_end_io)(struct buffer_head *bh, int uptodate);
locking rules:
called from interrupts. In other words, extreme care is needed here.
bh is locked, but that's all warranties we have here. Currently only RAID1,
highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
call this method upon the IO completion.
--------------------------- block_device_operations -----------------------
prototypes:
int (*open) (struct inode *, struct file *);
int (*release) (struct inode *, struct file *);
int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
int (*media_changed) (struct gendisk *);
int (*revalidate_disk) (struct gendisk *);
locking rules:
BKL bd_sem
open: yes yes
release: yes yes
ioctl: yes no
media_changed: no no
revalidate_disk: no no
The last two are called only from check_disk_change().
--------------------------- file_operations -------------------------------
prototypes:
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t,
loff_t);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int,
unsigned long);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
loff_t *);
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
loff_t *);
ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
void __user *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long,
unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*dir_notify)(struct file *, unsigned long);
};
locking rules:
All except ->poll() may block.
BKL
llseek: no (see below)
read: no
aio_read: no
write: no
aio_write: no
readdir: no
poll: no
ioctl: yes (see below)
unlocked_ioctl: no (see below)
compat_ioctl: no
mmap: no
open: maybe (see below)
flush: no
release: no
fsync: no (see below)
aio_fsync: no
fasync: yes (see below)
lock: yes
readv: no
writev: no
sendfile: no
sendpage: no
get_unmapped_area: no
check_flags: no
dir_notify: no
->llseek() locking has moved from llseek to the individual llseek
implementations. If your fs is not using generic_file_llseek, you
need to acquire and release the appropriate locks in your ->llseek().
For many filesystems, it is probably safe to acquire the inode
semaphore. Note some filesystems (i.e. remote ones) provide no
protection for i_size so you will need to use the BKL.
->open() locking is in-transit: big lock partially moved into the methods.
The only exception is ->open() in the instances of file_operations that never
end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices
(chrdev_open() takes lock before replacing ->f_op and calling the secondary
method. As soon as we fix the handling of module reference counters all
instances of ->open() will be called without the BKL.
Note: ext2_release() was *the* source of contention on fs-intensive
loads and dropping BKL on ->release() helps to get rid of that (we still
grab BKL for cases when we close a file that had been opened r/w, but that
can and should be done using the internal locking with smaller critical areas).
Current worst offender is ext2_get_block()...
->fasync() is a mess. This area needs a big cleanup and that will probably
affect locking.
->readdir() and ->ioctl() on directories must be changed. Ideally we would
move ->readdir() to inode_operations and use a separate method for directory
->ioctl() or kill the latter completely. One of the problems is that for
anything that resembles union-mount we won't have a struct file for all
components. And there are other reasons why the current interface is a mess...
->ioctl() on regular files is superceded by the ->unlocked_ioctl() that
doesn't take the BKL.
->read on directories probably must go away - we should just enforce -EISDIR
in sys_read() and friends.
->fsync() has i_sem on inode.
--------------------------- dquot_operations -------------------------------
prototypes:
int (*initialize) (struct inode *, int);
int (*drop) (struct inode *);
int (*alloc_space) (struct inode *, qsize_t, int);
int (*alloc_inode) (const struct inode *, unsigned long);
int (*free_space) (struct inode *, qsize_t);
int (*free_inode) (const struct inode *, unsigned long);
int (*transfer) (struct inode *, struct iattr *);
int (*write_dquot) (struct dquot *);
int (*acquire_dquot) (struct dquot *);
int (*release_dquot) (struct dquot *);
int (*mark_dirty) (struct dquot *);
int (*write_info) (struct super_block *, int);
These operations are intended to be more or less wrapping functions that ensure
a proper locking wrt the filesystem and call the generic quota operations.
What filesystem should expect from the generic quota functions:
FS recursion Held locks when called
initialize: yes maybe dqonoff_sem
drop: yes -
alloc_space: ->mark_dirty() -
alloc_inode: ->mark_dirty() -
free_space: ->mark_dirty() -
free_inode: ->mark_dirty() -
transfer: yes -
write_dquot: yes dqonoff_sem or dqptr_sem
acquire_dquot: yes dqonoff_sem or dqptr_sem
release_dquot: yes dqonoff_sem or dqptr_sem
mark_dirty: no -
write_info: yes dqonoff_sem
FS recursion means calling ->quota_read() and ->quota_write() from superblock
operations.
->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
only directly by the filesystem and do not call any fs functions only
the ->mark_dirty() operation.
More details about quota locking can be found in fs/dquot.c.
--------------------------- vm_operations_struct -----------------------------
prototypes:
void (*open)(struct vm_area_struct*);
void (*close)(struct vm_area_struct*);
struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
locking rules:
BKL mmap_sem
open: no yes
close: no yes
nopage: no yes
================================================================================
Dubious stuff
(if you break something or notice that it is broken and do not fix it yourself
- at least put it here)
ipc/shm.c::shm_delete() - may need BKL.
->read() and ->write() in many drivers are (probably) missing BKL.
drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.

View File

@@ -0,0 +1,57 @@
Mount options for ADFS
----------------------
uid=nnn All files in the partition will be owned by
user id nnn. Default 0 (root).
gid=nnn All files in the partition willbe in group
nnn. Default 0 (root).
ownmask=nnn The permission mask for ADFS 'owner' permissions
will be nnn. Default 0700.
othmask=nnn The permission mask for ADFS 'other' permissions
will be nnn. Default 0077.
Mapping of ADFS permissions to Linux permissions
------------------------------------------------
ADFS permissions consist of the following:
Owner read
Owner write
Other read
Other write
(In older versions, an 'execute' permission did exist, but this
does not hold the same meaning as the Linux 'execute' permission
and is now obsolete).
The mapping is performed as follows:
Owner read -> -r--r--r--
Owner write -> --w--w---w
Owner read and filetype UnixExec -> ---x--x--x
These are then masked by ownmask, eg 700 -> -rwx------
Possible owner mode permissions -> -rwx------
Other read -> -r--r--r--
Other write -> --w--w--w-
Other read and filetype UnixExec -> ---x--x--x
These are then masked by othmask, eg 077 -> ----rwxrwx
Possible other mode permissions -> ----rwxrwx
Hence, with the default masks, if a file is owner read/write, and
not a UnixExec filetype, then the permissions will be:
-rw-------
However, if the masks were ownmask=0770,othmask=0007, then this would
be modified to:
-rw-rw----
There is no restriction on what you can do with these masks. You may
wish that either read bits give read access to the file for all, but
keep the default write protection (ownmask=0755,othmask=0577):
-rw-r--r--
You can therefore tailor the permission translation to whatever you
desire the permissions should be under Linux.

View File

@@ -0,0 +1,219 @@
Overview of Amiga Filesystems
=============================
Not all varieties of the Amiga filesystems are supported for reading and
writing. The Amiga currently knows six different filesystems:
DOS\0 The old or original filesystem, not really suited for
hard disks and normally not used on them, either.
Supported read/write.
DOS\1 The original Fast File System. Supported read/write.
DOS\2 The old "international" filesystem. International means that
a bug has been fixed so that accented ("international") letters
in file names are case-insensitive, as they ought to be.
Supported read/write.
DOS\3 The "international" Fast File System. Supported read/write.
DOS\4 The original filesystem with directory cache. The directory
cache speeds up directory accesses on floppies considerably,
but slows down file creation/deletion. Doesn't make much
sense on hard disks. Supported read only.
DOS\5 The Fast File System with directory cache. Supported read only.
All of the above filesystems allow block sizes from 512 to 32K bytes.
Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
speed up almost everything at the expense of wasted disk space. The speed
gain above 4K seems not really worth the price, so you don't lose too
much here, either.
The muFS (multi user File System) equivalents of the above file systems
are supported, too.
Mount options for the AFFS
==========================
protect If this option is set, the protection bits cannot be altered.
setuid[=uid] This sets the owner of all files and directories in the file
system to uid or the uid of the current user, respectively.
setgid[=gid] Same as above, but for gid.
mode=mode Sets the mode flags to the given (octal) value, regardless
of the original permissions. Directories will get an x
permission if the corresponding r bit is set.
This is useful since most of the plain AmigaOS files
will map to 600.
reserved=num Sets the number of reserved blocks at the start of the
partition to num. You should never need this option.
Default is 2.
root=block Sets the block number of the root block. This should never
be necessary.
bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
1024, 2048 and 4096. Like the root option, this should
never be necessary, as the affs can figure it out itself.
quiet The file system will not return an error for disallowed
mode changes.
verbose The volume name, file system type and block size will
be written to the syslog when the filesystem is mounted.
mufs The filesystem is really a muFS, also it doesn't
identify itself as one. This option is necessary if
the filesystem wasn't formatted as muFS, but is used
as one.
prefix=path Path will be prefixed to every absolute path name of
symbolic links on an AFFS partition. Default = "/".
(See below.)
volume=name When symbolic links with an absolute path are created
on an AFFS partition, name will be prepended as the
volume name. Default = "" (empty string).
(See below.)
Handling of the Users/Groups and protection flags
=================================================
Amiga -> Linux:
The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
- R maps to r for user, group and others. On directories, R implies x.
- If both W and D are allowed, w will be set.
- E maps to x.
- H and P are always retained and ignored under Linux.
- A is always reset when a file is written to.
User id and group id will be used unless set[gu]id are given as mount
options. Since most of the Amiga file systems are single user systems
they will be owned by root. The root directory (the mount point) of the
Amiga filesystem will be owned by the user who actually mounts the
filesystem (the root directory doesn't have uid/gid fields).
Linux -> Amiga:
The Linux rwxrwxrwx file mode is handled as follows:
- r permission will set R for user, group and others.
- w permission will set W and D for user, group and others.
- x permission of the user will set E for plain files.
- All other flags (suid, sgid, ...) are ignored and will
not be retained.
Newly created files and directories will get the user and group ID
of the current user and a mode according to the umask.
Symbolic links
==============
Although the Amiga and Linux file systems resemble each other, there
are some, not always subtle, differences. One of them becomes apparent
with symbolic links. While Linux has a file system with exactly one
root directory, the Amiga has a separate root directory for each
file system (for example, partition, floppy disk, ...). With the Amiga,
these entities are called "volumes". They have symbolic names which
can be used to access them. Thus, symbolic links can point to a
different volume. AFFS turns the volume name into a directory name
and prepends the prefix path (see prefix option) to it.
Example:
You mount all your Amiga partitions under /amiga/<volume> (where
<volume> is the name of the volume), and you give the option
"prefix=/amiga/" when mounting all your AFFS partitions. (They
might be "User", "WB" and "Graphics", the mount points /amiga/User,
/amiga/WB and /amiga/Graphics). A symbolic link referring to
"User:sc/include/dos/dos.h" will be followed to
"/amiga/User/sc/include/dos/dos.h".
Examples
========
Command line:
mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
mount /dev/sda3 /Amiga -t affs
/etc/fstab entry:
/dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
IMPORTANT NOTE
==============
If you boot Windows 95 (don't know about 3.x, 98 and NT) while you
have an Amiga harddisk connected to your PC, it will overwrite
the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating
the Rigid Disk Block. Sheer luck has it that this is an unused
area of the RDB, so only the checksum doesn't match anymore.
Linux will ignore this garbage and recognize the RDB anyway, but
before you connect that drive to your Amiga again, you must
restore or repair your RDB. So please do make a backup copy of it
before booting Windows!
If the damage is already done, the following should fix the RDB
(where <disk> is the device name).
DO AT YOUR OWN RISK:
dd if=/dev/<disk> of=rdb.tmp count=1
cp rdb.tmp rdb.fixed
dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4
dd if=rdb.fixed of=/dev/<disk>
Bugs, Restrictions, Caveats
===========================
Quite a few things may not work as advertised. Not everything is
tested, though several hundred MB have been read and written using
this fs. For a most up-to-date list of bugs please consult
fs/affs/Changes.
Filenames are truncated to 30 characters without warning (this
can be changed by setting the compile-time option AFFS_NO_TRUNCATE
in include/linux/amigaffs.h).
Case is ignored by the affs in filename matching, but Linux shells
do care about the case. Example (with /wb being an affs mounted fs):
rm /wb/WRONGCASE
will remove /mnt/wrongcase, but
rm /wb/WR*
will not since the names are matched by the shell.
The block allocation is designed for hard disk partitions. If more
than 1 process writes to a (small) diskette, the blocks are allocated
in an ugly way (but the real AFFS doesn't do much better). This
is also true when space gets tight.
You cannot execute programs on an OFS (Old File System), since the
program files cannot be memory mapped due to the 488 byte blocks.
For the same reason you cannot mount an image on such a filesystem
via the loopback device.
The bitmap valid flag in the root block may not be accurate when the
system crashes while an affs partition is mounted. There's currently
no way to fix a garbled filesystem without an Amiga (disk validator)
or manually (who would do this?). Maybe later.
If you mount affs partitions on system startup, you may want to tell
fsck that the fs should not be checked (place a '0' in the sixth field
of /etc/fstab).
It's not possible to read floppy disks with a normal PC or workstation
due to an incompatibility with the Amiga floppy controller.
If you are interested in an Amiga Emulator for Linux, look at
http://www-users.informatik.rwth-aachen.de/~crux/uae.html

View File

@@ -0,0 +1,155 @@
kAFS: AFS FILESYSTEM
====================
ABOUT
=====
This filesystem provides a fairly simple AFS filesystem driver. It is under
development and only provides very basic facilities. It does not yet support
the following AFS features:
(*) Write support.
(*) Communications security.
(*) Local caching.
(*) pioctl() system call.
(*) Automatic mounting of embedded mountpoints.
USAGE
=====
When inserting the driver modules the root cell must be specified along with a
list of volume location server IP addresses:
insmod rxrpc.o
insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
The first module is a driver for the RxRPC remote operation protocol, and the
second is the actual filesystem driver for the AFS filesystem.
Once the module has been loaded, more modules can be added by the following
procedure:
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
Where the parameters to the "add" command are the name of a cell and a list of
volume location servers within that cell.
Filesystems can be mounted anywhere by commands similar to the following:
mount -t afs "%cambridge.redhat.com:root.afs." /afs
mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
mount -t afs "#root.afs." /afs
mount -t afs "#root.cell." /afs/cambridge
NB: When using this on Linux 2.4, the mount command has to be different,
since the filesystem doesn't have access to the device name argument:
mount -t afs none /afs -ovol="#root.afs."
Where the initial character is either a hash or a percent symbol depending on
whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
volume, but are willing to use a R/W volume instead (percent).
The name of the volume can be suffixes with ".backup" or ".readonly" to
specify connection to only volumes of those types.
The name of the cell is optional, and if not given during a mount, then the
named volume will be looked up in the cell specified during insmod.
Additional cells can be added through /proc (see later section).
MOUNTPOINTS
===========
AFS has a concept of mountpoints. These are specially formatted symbolic links
(of the same form as the "device name" passed to mount). kAFS presents these
to the user as directories that have special properties:
(*) They cannot be listed. Running a program like "ls" on them will incur an
EREMOTE error (Object is remote).
(*) Other objects can't be looked up inside of them. This also incurs an
EREMOTE error.
(*) They can be queried with the readlink() system call, which will return
the name of the mountpoint to which they point. The "readlink" program
will also work.
(*) They can be mounted on (which symbolic links can't).
PROC FILESYSTEM
===============
The rxrpc module creates a number of files in various places in the /proc
filesystem:
(*) Firstly, some information files are made available in a directory called
"/proc/net/rxrpc/". These list the extant transport endpoint, peer,
connection and call records.
(*) Secondly, some control files are made available in a directory called
"/proc/sys/rxrpc/". Currently, all these files can be used for is to
turn on various levels of tracing.
The AFS modules creates a "/proc/fs/afs/" directory and populates it:
(*) A "cells" file that lists cells currently known to the afs module.
(*) A directory per cell that contains files that list volume location
servers, volumes, and active servers known within that cell.
THE CELL DATABASE
=================
The filesystem maintains an internal database of all the cells it knows and
the IP addresses of the volume location servers for those cells. The cell to
which the computer belongs is added to the database when insmod is performed
by the "rootcell=" argument.
Further cells can be added by commands similar to the following:
echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
No other cell database operations are available at this time.
EXAMPLES
========
Here's what I use to test this. Some of the names and IP addresses are local
to my internal DNS. My "root.afs" partition has a mount point within it for
some public volumes volumes.
insmod -S /tmp/rxrpc.o
insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
mount -t afs \%root.afs. /afs
mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells
mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc
mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project
mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
umount /afs/grand.central.org/user
umount /afs/grand.central.org/software
umount /afs/grand.central.org/service
umount /afs/grand.central.org/project
umount /afs/grand.central.org/doc
umount /afs/grand.central.org/contrib
umount /afs/grand.central.org/archive
umount /afs/grand.central.org
umount /afs/cambridge.redhat.com
umount /afs
rmmod kafs
rmmod rxrpc

View File

@@ -0,0 +1,118 @@
Support is available for filesystems that wish to do automounting support (such
as kAFS which can be found in fs/afs/). This facility includes allowing
in-kernel mounts to be performed and mountpoint degradation to be
requested. The latter can also be requested by userspace.
======================
IN-KERNEL AUTOMOUNTING
======================
A filesystem can now mount another filesystem on one of its directories by the
following procedure:
(1) Give the directory a follow_link() operation.
When the directory is accessed, the follow_link op will be called, and
it will be provided with the location of the mountpoint in the nameidata
structure (vfsmount and dentry).
(2) Have the follow_link() op do the following steps:
(a) Call do_kern_mount() to call the appropriate filesystem to set up a
superblock and gain a vfsmount structure representing it.
(b) Copy the nameidata provided as an argument and substitute the dentry
argument into it the copy.
(c) Call do_add_mount() to install the new vfsmount into the namespace's
mountpoint tree, thus making it accessible to userspace. Use the
nameidata set up in (b) as the destination.
If the mountpoint will be automatically expired, then do_add_mount()
should also be given the location of an expiration list (see further
down).
(d) Release the path in the nameidata argument and substitute in the new
vfsmount and its root dentry. The ref counts on these will need
incrementing.
Then from userspace, you can just do something like:
[root@andromeda root]# mount -t afs \#root.afs. /afs
[root@andromeda root]# ls /afs
asd cambridge cambridge.redhat.com grand.central.org
[root@andromeda root]# ls /afs/cambridge
afsdoc
[root@andromeda root]# ls /afs/cambridge/afsdoc/
ChangeLog html LICENSE pdf RELNOTES-1.2.2
And then if you look in the mountpoint catalogue, you'll see something like:
[root@andromeda root]# cat /proc/mounts
...
#root.afs. /afs afs rw 0 0
#root.cell. /afs/cambridge.redhat.com afs rw 0 0
#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
===========================
AUTOMATIC MOUNTPOINT EXPIRY
===========================
Automatic expiration of mountpoints is easy, provided you've mounted the
mountpoint to be expired in the automounting procedure outlined above.
To do expiration, you need to follow these steps:
(3) Create at least one list off which the vfsmounts to be expired can be
hung. Access to this list will be governed by the vfsmount_lock.
(4) In step (2c) above, the call to do_add_mount() should be provided with a
pointer to this list. It will hang the vfsmount off of it if it succeeds.
(5) When you want mountpoints to be expired, call mark_mounts_for_expiry()
with a pointer to this list. This will process the list, marking every
vfsmount thereon for potential expiry on the next call.
If a vfsmount was already flagged for expiry, and if its usage count is 1
(it's only referenced by its parent vfsmount), then it will be deleted
from the namespace and thrown away (effectively unmounted).
It may prove simplest to simply call this at regular intervals, using
some sort of timed event to drive it.
The expiration flag is cleared by calls to mntput. This means that expiration
will only happen on the second expiration request after the last time the
mountpoint was accessed.
If a mountpoint is moved, it gets removed from the expiration list. If a bind
mount is made on an expirable mount, the new vfsmount will not be on the
expiration list and will not expire.
If a namespace is copied, all mountpoints contained therein will be copied,
and the copies of those that are on an expiration list will be added to the
same expiration list.
=======================
USERSPACE DRIVEN EXPIRY
=======================
As an alternative, it is possible for userspace to request expiry of any
mountpoint (though some will be rejected - the current process's idea of the
rootfs for example). It does this by passing the MNT_EXPIRE flag to
umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH.
If the mountpoint in question is in referenced by something other than
umount() or its parent mountpoint, an EBUSY error will be returned and the
mountpoint will not be marked for expiration or unmounted.
If the mountpoint was not already marked for expiry at that time, an EAGAIN
error will be given and it won't be unmounted.
Otherwise if it was already marked and it wasn't referenced, unmounting will
take place as usual.
Again, the expiration flag is cleared every time anything other than umount()
looks at a mountpoint.

View File

@@ -0,0 +1,117 @@
BeOS filesystem for Linux
Document last updated: Dec 6, 2001
WARNING
=======
Make sure you understand that this is alpha software. This means that the
implementation is neither complete nor well-tested.
I DISCLAIM ALL RESPONSIBILTY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
LICENSE
=====
This software is covered by the GNU General Public License.
See the file COPYING for the complete text of the license.
Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
AUTHOR
=====
The largest part of the code written by Will Dyson <will_dyson@pobox.com>
He has been working on the code since Aug 13, 2001. See the changelog for
details.
Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
His orriginal code can still be found at:
<http://hp.vector.co.jp/authors/VA008030/bfs/>
Does anyone know of a more current email address for Makoto? He doesn't
respond to the address given above...
Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru>
WHAT IS THIS DRIVER?
==================
This module implements the native filesystem of BeOS <http://www.be.com/>
for the linux 2.4.1 and later kernels. Currently it is a read-only
implementation.
Which is it, BFS or BEFS?
================
Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
But Unixware Boot Filesystem is called bfs, too. And they are already in
the kernel. Because of this nameing conflict, on Linux the BeOS
filesystem is called befs.
HOW TO INSTALL
==============
step 1. Install the BeFS patch into the source code tree of linux.
Apply the patchfile to your kernel source tree.
Assuming that your kernel source is in /foo/bar/linux and the patchfile
is called patch-befs-xxx, you would do the following:
cd /foo/bar/linux
patch -p1 < /path/to/patch-befs-xxx
if the patching step fails (i.e. there are rejected hunks), you can try to
figure it out yourself (it shouldn't be hard), or mail the maintainer
(Will Dyson <will_dyson@pobox.com>) for help.
step 2. Configuretion & make kernel
The linux kernel has many compile-time options. Most of them are beyond the
scope of this document. I suggest the Kernel-HOWTO document as a good general
reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html>
However, to use the BeFS module, you must enable it at configure time.
cd /foo/bar/linux
make menuconfig (or xconfig)
The BeFS module is not a standard part of the linux kernel, so you must first
enable support for experimental code under the "Code maturity level" menu.
Then, under the "Filesystems" menu will be an option called "BeFS
filesystem (experimental)", or something like that. Enable that option
(it is fine to make it a module).
Save your kernel configuration and then build your kernel.
step 3. Install
See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
instructions on this critical step.
USING BFS
=========
To use the BeOS filesystem, use filesystem type 'befs'.
ex)
mount -t befs /dev/fd0 /beos
MOUNT OPTIONS
=============
uid=nnn All files in the partition will be owned by user id nnn.
gid=nnn All files in the partition will be in group nnn.
iocharset=xxx Use xxx as the name of the NLS translation table.
debug The driver will output debugging information to the syslog.
HOW TO GET LASTEST VERSION
==========================
The latest version is currently available at:
<http://befs-driver.sourceforge.net/>
ANY KNOWN BUGS?
===========
As of Jan 20, 2002:
None
SPECIAL THANKS
==============
Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
Hiroyuki Yamada ... Testing LinuxPPC.

View File

@@ -0,0 +1,57 @@
BFS FILESYSTEM FOR LINUX
========================
The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
usually contains the kernel image and a few other files required for the
boot process.
In order to access /stand partition under Linux you obviously need to
know the partition number and the kernel must support UnixWare disk slices
(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
depend on having UnixWare disklabel support because one can also mount
BFS filesystem via loopback:
# losetup /dev/loop0 stand.img
# mount -t bfs /dev/loop0 /mnt/stand
where stand.img is a file containing the image of BFS filesystem.
When you have finished using it and umounted you need to also deallocate
/dev/loop0 device by:
# losetup -d /dev/loop0
You can simplify mounting by just typing:
# mount -t bfs -o loop stand.img /mnt/stand
this will allocate the first available loopback device (and load loop.o
kernel module if necessary) automatically. If the loopback driver is not
loaded automatically, make sure that your kernel is compiled with kmod
support (CONFIG_KMOD) enabled. Beware that umount will not
deallocate /dev/loopN device if /etc/mtab file on your system is a
symbolic link to /proc/mounts. You will need to do it manually using
"-d" switch of losetup(8). Read losetup(8) manpage for more info.
To create the BFS image under UnixWare you need to find out first which
slice contains it. The command prtvtoc(1M) is your friend:
# prtvtoc /dev/rdsk/c0b0t0d0s0
(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
look for the slice with tag "STAND", which is usually slice 10. With this
information you can use dd(1) to create the BFS image:
# umount /stand
# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
Just in case, you can verify that you have done the right thing by checking
the magic number:
# od -Ad -tx4 stand.img | more
The first 4 bytes should be 0x1badface.
If you have any patches, questions or suggestions regarding this BFS
implementation please contact the author:
Tigran A. Aivazian <tigran@veritas.com>

View File

@@ -0,0 +1,51 @@
This is the client VFS module for the Common Internet File System
(CIFS) protocol which is the successor to the Server Message Block
(SMB) protocol, the native file sharing mechanism for most early
PC operating systems. CIFS is fully supported by current network
file servers such as Windows 2000, Windows 2003 (including
Windows XP) as well by Samba (which provides excellent CIFS
server support for Linux and many other operating systems), so
this network filesystem client can mount to a wide variety of
servers. The smbfs module should be used instead of this cifs module
for mounting to older SMB servers such as OS/2. The smbfs and cifs
modules can coexist and do not conflict. The CIFS VFS filesystem
module is designed to work well with servers that implement the
newer versions (dialects) of the SMB/CIFS protocol such as Samba,
the program written by Andrew Tridgell that turns any Unix host
into a SMB/CIFS file server.
The intent of this module is to provide the most advanced network
file system function for CIFS compliant servers, including better
POSIX compliance, secure per-user session establishment, high
performance safe distributed caching (oplock), optional packet
signing, large files, Unicode support and other internationalization
improvements. Since both Samba server and this filesystem client support
the CIFS Unix extensions, the combination can provide a reasonable
alternative to NFSv4 for fileserving in some Linux to Linux environments,
not just in Linux to Windows environments.
This filesystem has an optional mount utility (mount.cifs) that can
be obtained from the project page and installed in the path in the same
directory with the other mount helpers (such as mount.smbfs).
Mounting using the cifs filesystem without installing the mount helper
requires specifying the server's ip address.
For Linux 2.4:
mount //anything/here /mnt_target -o
user=username,pass=password,unc=//ip_address_of_server/sharename
For Linux 2.5:
mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password
For more information on the module see the project page at
http://us1.samba.org/samba/Linux_CIFS_client.html
For more information on CIFS see:
http://www.snia.org/tech_activities/CIFS
or the Samba site:
http://www.samba.org

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,76 @@
Cramfs - cram a filesystem onto a small ROM
cramfs is designed to be simple and small, and to compress things well.
It uses the zlib routines to compress a file one page at a time, and
allows random page access. The meta-data is not compressed, but is
expressed in a very terse representation to make it use much less
diskspace than traditional filesystems.
You can't write to a cramfs filesystem (making it compressible and
compact also makes it _very_ hard to update on-the-fly), so you have to
create the disk image with the "mkcramfs" utility.
Usage Notes
-----------
File sizes are limited to less than 16MB.
Maximum filesystem size is a little over 256MB. (The last file on the
filesystem is allowed to extend past 256MB.)
Only the low 8 bits of gid are stored. The current version of
mkcramfs simply truncates to 8 bits, which is a potential security
issue.
Hard links are supported, but hard linked files
will still have a link count of 1 in the cramfs image.
Cramfs directories have no `.' or `..' entries. Directories (like
every other file on cramfs) always have a link count of 1. (There's
no need to use -noleaf in `find', btw.)
No timestamps are stored in a cramfs, so these default to the epoch
(1970 GMT). Recently-accessed files may have updated timestamps, but
the update lasts only as long as the inode is cached in memory, after
which the timestamp reverts to 1970, i.e. moves backwards in time.
Currently, cramfs must be written and read with architectures of the
same endianness, and can be read only by kernels with PAGE_CACHE_SIZE
== 4096. At least the latter of these is a bug, but it hasn't been
decided what the best fix is. For the moment if you have larger pages
you can just change the #define in mkcramfs.c, so long as you don't
mind the filesystem becoming unreadable to future kernels.
For /usr/share/magic
--------------------
0 ulelong 0x28cd3d45 Linux cramfs offset 0
>4 ulelong x size %d
>8 ulelong x flags 0x%x
>12 ulelong x future 0x%x
>16 string >\0 signature "%.16s"
>32 ulelong x fsid.crc 0x%x
>36 ulelong x fsid.edition %d
>40 ulelong x fsid.blocks %d
>44 ulelong x fsid.files %d
>48 string >\0 name "%.16s"
512 ulelong 0x28cd3d45 Linux cramfs offset 512
>516 ulelong x size %d
>520 ulelong x flags 0x%x
>524 ulelong x future 0x%x
>528 string >\0 signature "%.16s"
>544 ulelong x fsid.crc 0x%x
>548 ulelong x fsid.edition %d
>552 ulelong x fsid.blocks %d
>556 ulelong x fsid.files %d
>560 string >\0 name "%.16s"
Hacker Notes
------------
See fs/cramfs/README for filesystem layout and implementation notes.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,40 @@
Device File System (devfs) ToDo List
Richard Gooch <rgooch@atnf.csiro.au>
3-JUL-2000
This is a list of things to be done for better devfs support in the
Linux kernel. If you'd like to contribute to the devfs, please have a
look at this list for anything that is unallocated. Also, if there are
items missing (surely), please contact me so I can add them to the
list (preferably with your name attached to them:-).
- >256 ptys
Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
- Amiga floppy driver (drivers/block/amiflop.c)
- Atari floppy driver (drivers/block/ataflop.c)
- SWIM3 (Super Woz Integrated Machine 3) floppy driver (drivers/block/swim3.c)
- Amiga ZorroII ramdisc driver (drivers/block/z2ram.c)
- Parallel port ATAPI CD-ROM (drivers/block/paride/pcd.c)
- Parallel port ATAPI floppy (drivers/block/paride/pf.c)
- AP1000 block driver (drivers/ap1000/ap.c, drivers/ap1000/ddv.c)
- Archimedes floppy (drivers/acorn/block/fd1772.c)
- MFM hard drive (drivers/acorn/block/mfmhd.c)
- I2O block device (drivers/message/i2o/i2o_block.c)
- ST-RAM device (arch/m68k/atari/stram.c)
- Raw devices

View File

@@ -0,0 +1,65 @@
/* -*- auto-fill -*- */
Device File System (devfs) Boot Options
Richard Gooch <rgooch@atnf.csiro.au>
18-AUG-2001
When CONFIG_DEVFS_DEBUG is enabled, you can pass several boot options
to the kernel to debug devfs. The boot options are prefixed by
"devfs=", and are separated by commas. Spaces are not allowed. The
syntax looks like this:
devfs=<option1>,<option2>,<option3>
and so on. For example, if you wanted to turn on debugging for module
load requests and device registration, you would do:
devfs=dmod,dreg
You may prefix "no" to any option. This will invert the option.
Debugging Options
=================
These requires CONFIG_DEVFS_DEBUG to be enabled.
Note that all debugging options have 'd' as the first character. By
default all options are off. All debugging output is sent to the
kernel logs. The debugging options do not take effect until the devfs
version message appears (just prior to the root filesystem being
mounted).
These are the options:
dmod print module load requests to <request_module>
dreg print device register requests to <devfs_register>
dunreg print device unregister requests to <devfs_unregister>
dchange print device change requests to <devfs_set_flags>
dilookup print inode lookup requests
diget print VFS inode allocations
diunlink print inode unlinks
dichange print inode changes
dimknod print calls to mknod(2)
dall some debugging turned on
Other Options
=============
These control the default behaviour of devfs. The options are:
mount mount devfs onto /dev at boot time
only disable non-devfs device nodes for devfs-capable drivers

View File

@@ -0,0 +1,113 @@
Locking scheme used for directory operations is based on two
kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
For our purposes all operations fall in 5 classes:
1) read access. Locking rules: caller locks directory we are accessing.
2) object creation. Locking rules: same as above.
3) object removal. Locking rules: caller locks parent, finds victim,
locks victim and calls the method.
4) rename() that is _not_ cross-directory. Locking rules: caller locks
the parent, finds source and target, if target already exists - locks it
and then calls the method.
5) link creation. Locking rules:
* lock parent
* check that source is not a directory
* lock source
* call the method.
6) cross-directory rename. The trickiest in the whole bunch. Locking
rules:
* lock the filesystem
* lock parents in "ancestors first" order.
* find source and target.
* if old parent is equal to or is a descendent of target
fail with -ENOTEMPTY
* if new parent is equal to or is a descendent of source
fail with -ELOOP
* if target exists - lock it.
* call the method.
The rules above obviously guarantee that all directories that are going to be
read, modified or removed by method will be locked by caller.
If no directory is its own ancestor, the scheme above is deadlock-free.
Proof:
First of all, at any moment we have a partial ordering of the
objects - A < B iff A is an ancestor of B.
That ordering can change. However, the following is true:
(1) if object removal or non-cross-directory rename holds lock on A and
attempts to acquire lock on B, A will remain the parent of B until we
acquire the lock on B. (Proof: only cross-directory rename can change
the parent of object and it would have to lock the parent).
(2) if cross-directory rename holds the lock on filesystem, order will not
change until rename acquires all locks. (Proof: other cross-directory
renames will be blocked on filesystem lock and we don't start changing
the order until we had acquired all locks).
(3) any operation holds at most one lock on non-directory object and
that lock is acquired after all other locks. (Proof: see descriptions
of operations).
Now consider the minimal deadlock. Each process is blocked on
attempt to acquire some lock and already holds at least one lock. Let's
consider the set of contended locks. First of all, filesystem lock is
not contended, since any process blocked on it is not holding any locks.
Thus all processes are blocked on ->i_sem.
Non-directory objects are not contended due to (3). Thus link
creation can't be a part of deadlock - it can't be blocked on source
and it means that it doesn't hold any locks.
Any contended object is either held by cross-directory rename or
has a child that is also contended. Indeed, suppose that it is held by
operation other than cross-directory rename. Then the lock this operation
is blocked on belongs to child of that object due to (1).
It means that one of the operations is cross-directory rename.
Otherwise the set of contended objects would be infinite - each of them
would have a contended child and we had assumed that no object is its
own descendent. Moreover, there is exactly one cross-directory rename
(see above).
Consider the object blocking the cross-directory rename. One
of its descendents is locked by cross-directory rename (otherwise we
would again have an infinite set of of contended objects). But that
means that cross-directory rename is taking locks out of order. Due
to (2) the order hadn't changed since we had acquired filesystem lock.
But locking rules for cross-directory rename guarantee that we do not
try to acquire lock on descendent before the lock on ancestor.
Contradiction. I.e. deadlock is impossible. Q.E.D.
These operations are guaranteed to avoid loop creation. Indeed,
the only operation that could introduce loops is cross-directory rename.
Since the only new (parent, child) pair added by rename() is (new parent,
source), such loop would have to contain these objects and the rest of it
would have to exist before rename(). I.e. at the moment of loop creation
rename() responsible for that would be holding filesystem lock and new parent
would have to be equal to or a descendent of source. But that means that
new parent had been equal to or a descendent of source since the moment when
we had acquired filesystem lock and rename() would fail with -ELOOP in that
case.
While this locking scheme works for arbitrary DAGs, it relies on
ability to check that directory is a descendent of another object. Current
implementation assumes that directory graph is a tree. This assumption is
also preserved by all operations (cross-directory rename on a tree that would
not introduce a cycle will leave it a tree and link() fails for directories).
Notice that "directory" in the above == "anything that might have
children", so if we are going to introduce hybrid objects we will need
either to make sure that link(2) doesn't work for them or to make changes
in is_subdir() that would make it work even in presence of such beasts.

View File

@@ -0,0 +1,383 @@
The Second Extended Filesystem
==============================
ext2 was originally released in January 1993. Written by R\'emy Card,
Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
Extended Filesystem. It is currently still (April 2001) the predominant
filesystem in use by Linux. There are also implementations available
for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
Options
=======
Most defaults are determined by the filesystem superblock, and can be
set using tune2fs(8). Kernel-determined defaults are indicated by (*).
bsddf (*) Makes `df' act like BSD.
minixdf Makes `df' act like Minix.
check Check block and inode bitmaps at mount time
(requires CONFIG_EXT2_CHECK).
check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed)
debug Extra debugging information is sent to the
kernel syslog. Useful for developers.
errors=continue Keep going on a filesystem error.
errors=remount-ro Remount the filesystem read-only on an error.
errors=panic Panic and halt the machine if an error occurs.
grpid, bsdgroups Give objects the same group ID as their parent.
nogrpid, sysvgroups New objects have the group ID of their creator.
nouid32 Use 16-bit UIDs and GIDs.
oldalloc Enable the old block allocator. Orlov should
have better performance, we'd like to get some
feedback if it's the contrary for you.
orlov (*) Use the Orlov block allocator.
(See http://lwn.net/Articles/14633/ and
http://lwn.net/Articles/14446/.)
resuid=n The user ID which may use the reserved blocks.
resgid=n The group ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
user_xattr Enable "user." POSIX Extended Attributes
(requires CONFIG_EXT2_FS_XATTR).
See also http://acl.bestbits.at
nouser_xattr Don't support "user." extended attributes.
acl Enable POSIX Access Control Lists support
(requires CONFIG_EXT2_FS_POSIX_ACL).
See also http://acl.bestbits.at
noacl Don't support POSIX ACLs.
nobh Do not attach buffer_heads to file pagecache.
grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
Specification
=============
ext2 shares many properties with traditional Unix filesystems. It has
the concepts of blocks, inodes and directories. It has space in the
specification for Access Control Lists (ACLs), fragments, undeletion and
compression though these are not yet implemented (some are available as
separate patches). There is also a versioning mechanism to allow new
features (such as journalling) to be added in a maximally compatible
manner.
Blocks
------
The space in the device or file is split up into blocks. These are
a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
which is decided when the filesystem is created. Smaller blocks mean
less wasted space per file, but require slightly more accounting overhead,
and also impose other limits on the size of files and the filesystem.
Block Groups
------------
Blocks are clustered into block groups in order to reduce fragmentation
and minimise the amount of head seeking when reading a large amount
of consecutive data. Information about each block group is kept in a
descriptor table stored in the block(s) immediately after the superblock.
Two blocks near the start of each group are reserved for the block usage
bitmap and the inode usage bitmap which show which blocks and inodes
are in use. Since each bitmap is limited to a single block, this means
that the maximum size of a block group is 8 times the size of a block.
The block(s) following the bitmaps in each block group are designated
as the inode table for that block group and the remainder are the data
blocks. The block allocation algorithm attempts to allocate data blocks
in the same block group as the inode which contains them.
The Superblock
--------------
The superblock contains all the information about the configuration of
the filing system. The primary copy of the superblock is stored at an
offset of 1024 bytes from the start of the device, and it is essential
to mounting the filesystem. Since it is so important, backup copies of
the superblock are stored in block groups throughout the filesystem.
The first version of ext2 (revision 0) stores a copy at the start of
every block group, along with backups of the group descriptor block(s).
Because this can consume a considerable amount of space for large
filesystems, later revisions can optionally reduce the number of backup
copies by only putting backups in specific groups (this is the sparse
superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
The information in the superblock contains fields such as the total
number of inodes and blocks in the filesystem and how many are free,
how many inodes and blocks are in each block group, when the filesystem
was mounted (and if it was cleanly unmounted), when it was modified,
what version of the filesystem it is (see the Revisions section below)
and which OS created it.
If the filesystem is revision 1 or higher, then there are extra fields,
such as a volume name, a unique identification number, the inode size,
and space for optional filesystem features to store configuration info.
All fields in the superblock (as in all other ext2 structures) are stored
on the disc in little endian format, so a filesystem is portable between
machines without having to know what machine it was created on.
Inodes
------
The inode (index node) is a fundamental concept in the ext2 filesystem.
Each object in the filesystem is represented by an inode. The inode
structure contains pointers to the filesystem blocks which contain the
data held in the object and all of the metadata about an object except
its name. The metadata about an object includes the permissions, owner,
group, flags, size, number of blocks used, access time, change time,
modification time, deletion time, number of links, fragments, version
(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
There are some reserved fields which are currently unused in the inode
structure and several which are overloaded. One field is reserved for the
directory ACL if the inode is a directory and alternately for the top 32
bits of the file size if the inode is a regular file (allowing file sizes
larger than 2GB). The translator field is unused under Linux, but is used
by the HURD to reference the inode of a program which will be used to
interpret this object. Most of the remaining reserved fields have been
used up for both Linux and the HURD for larger owner and group fields,
The HURD also has a larger mode field so it uses another of the remaining
fields to store the extra more bits.
There are pointers to the first 12 blocks which contain the file's data
in the inode. There is a pointer to an indirect block (which contains
pointers to the next set of blocks), a pointer to a doubly-indirect
block (which contains pointers to indirect blocks) and a pointer to a
trebly-indirect block (which contains pointers to doubly-indirect blocks).
The flags field contains some ext2-specific flags which aren't catered
for by the standard chmod flags. These flags can be listed with lsattr
and changed with the chattr command, and allow specific filesystem
behaviour on a per-file basis. There are flags for secure deletion,
undeletable, compression, synchronous updates, immutability, append-only,
dumpable, no-atime, indexed directories, and data-journaling. Not all
of these are supported yet.
Directories
-----------
A directory is a filesystem object and has an inode just like a file.
It is a specially formatted file containing records which associate
each name with an inode number. Later revisions of the filesystem also
encode the type of the object (file, directory, symlink, device, fifo,
socket) to avoid the need to check the inode itself for this information
(support for taking advantage of this feature does not yet exist in
Glibc 2.2).
The inode allocation code tries to assign inodes which are in the same
block group as the directory in which they are first created.
The current implementation of ext2 uses a singly-linked list to store
the filenames in the directory; a pending enhancement uses hashing of the
filenames to allow lookup without the need to scan the entire directory.
The current implementation never removes empty directory blocks once they
have been allocated to hold more files.
Special files
-------------
Symbolic links are also filesystem objects with inodes. They deserve
special mention because the data for them is stored within the inode
itself if the symlink is less than 60 bytes long. It uses the fields
which would normally be used to store the pointers to data blocks.
This is a worthwhile optimisation as it we avoid allocating a full
block for the symlink, and most symlinks are less than 60 characters long.
Character and block special devices never have data blocks assigned to
them. Instead, their device number is stored in the inode, again reusing
the fields which would be used to point to the data blocks.
Reserved Space
--------------
In ext2, there is a mechanism for reserving a certain number of blocks
for a particular user (normally the super-user). This is intended to
allow for the system to continue functioning even if non-priveleged users
fill up all the space available to them (this is independent of filesystem
quotas). It also keeps the filesystem from filling up entirely which
helps combat fragmentation.
Filesystem check
----------------
At boot time, most systems run a consistency check (e2fsck) on their
filesystems. The superblock of the ext2 filesystem contains several
fields which indicate whether fsck should actually run (since checking
the filesystem at boot can take a long time if it is large). fsck will
run if the filesystem was not cleanly unmounted, if the maximum mount
count has been exceeded or if the maximum time between checks has been
exceeded.
Feature Compatibility
---------------------
The compatibility feature mechanism used in ext2 is sophisticated.
It safely allows features to be added to the filesystem, without
unnecessarily sacrificing compatibility with older versions of the
filesystem code. The feature compatibility mechanism is not supported by
the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
revision 1. There are three 32-bit fields, one for compatible features
(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
incompatible (INCOMPAT) features.
These feature flags have specific meanings for the kernel as follows:
A COMPAT flag indicates that a feature is present in the filesystem,
but the on-disk format is 100% compatible with older on-disk formats, so
a kernel which didn't know anything about this feature could read/write
the filesystem without any chance of corrupting the filesystem (or even
making it inconsistent). This is essentially just a flag which says
"this filesystem has a (hidden) feature" that the kernel or e2fsck may
want to be aware of (more on e2fsck and feature flags later). The ext3
HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
a regular file with data blocks in it so the kernel does not need to
take any special notice of it if it doesn't understand ext3 journaling.
An RO_COMPAT flag indicates that the on-disk format is 100% compatible
with older on-disk formats for reading (i.e. the feature does not change
the visible on-disk format). However, an old kernel writing to such a
filesystem would/could corrupt the filesystem, so this is prevented. The
most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
sparse groups allow file data blocks where superblock/group descriptor
backups used to live, and ext2_free_blocks() refuses to free these blocks,
which would leading to inconsistent bitmaps. An old kernel would also
get an error if it tried to free a series of blocks which crossed a group
boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
An INCOMPAT flag indicates the on-disk format has changed in some
way that makes it unreadable by older kernels, or would otherwise
cause a problem if an old kernel tried to mount it. FILETYPE is an
INCOMPAT flag because older kernels would think a filename was longer
than 256 characters, which would lead to corrupt directory listings.
The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
doesn't understand compression, you would just get garbage back from
read() instead of it automatically decompressing your data. The ext3
RECOVER flag is needed to prevent a kernel which does not understand the
ext3 journal from mounting the filesystem without replaying the journal.
For e2fsck, it needs to be more strict with the handling of these
flags than the kernel. If it doesn't understand ANY of the COMPAT,
RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
because it has no way of verifying whether a given feature is valid
or not. Allowing e2fsck to succeed on a filesystem with an unknown
feature is a false sense of security for the user. Refusing to check
a filesystem with unknown features is a good incentive for the user to
update to the latest e2fsck. This also means that anyone adding feature
flags to ext2 also needs to update e2fsck to verify these features.
Metadata
--------
It is frequently claimed that the ext2 implementation of writing
asynchronous metadata is faster than the ffs synchronous metadata
scheme but less reliable. Both methods are equally resolvable by their
respective fsck programs.
If you're exceptionally paranoid, there are 3 ways of making metadata
writes synchronous on ext2:
per-file if you have the program source: use the O_SYNC flag to open()
per-file if you don't have the source: use "chattr +S" on the file
per-filesystem: add the "sync" option to mount (or in /etc/fstab)
the first and last are not ext2 specific but do force the metadata to
be written synchronously. See also Journaling below.
Limitations
-----------
There are various limits imposed by the on-disk layout of ext2. Other
limits are imposed by the current implementation of the kernel code.
Many of the limits are determined at the time the filesystem is first
created, and depend upon the block size chosen. The ratio of inodes to
data blocks is fixed at filesystem creation time, so the only way to
increase the number of inodes is to increase the size of the filesystem.
No tools currently exist which can change the ratio of inodes to blocks.
Most of these limits could be overcome with slight changes in the on-disk
format and using a compatibility flag to signal the format change (at
the expense of some compatibility).
Filesystem block size: 1kB 2kB 4kB 8kB
File size limit: 16GB 256GB 2048GB 2048GB
Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
There is a 2.4 kernel limit of 2048GB for a single block device, so no
filesystem larger than that can be created at this time. There is also
an upper limit on the block size imposed by the page size of the kernel,
so 8kB blocks are only allowed on Alpha systems (and other architectures
which support larger pages).
There is an upper limit of 32768 subdirectories in a single directory.
There is a "soft" upper limit of about 10-15k files in a single directory
with the current linear linked-list directory implementation. This limit
stems from performance problems when creating and deleting (and also
finding) files in such large directories. Using a hashed directory index
(under development) allows 100k-1M+ files in a single directory without
performance problems (although RAM size becomes an issue at this point).
The (meaningless) absolute upper limit of files in a single directory
(imposed by the file size, the realistic limit is obviously much less)
is over 130 trillion files. It would be higher except there are not
enough 4-character names to make up unique directory entries, so they
have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.
Journaling
----------
A journaling extension to the ext2 code has been developed by Stephen
Tweedie. It avoids the risks of metadata corruption and the need to
wait for e2fsck to complete after a crash, without requiring a change
to the on-disk ext2 layout. In a nutshell, the journal is a regular
file which stores whole metadata (and optionally data) blocks that have
been modified, prior to writing them into the filesystem. This means
it is possible to add a journal to an existing ext2 filesystem without
the need for data conversion.
When changes to the filesystem (e.g. a file is renamed) they are stored in
a transaction in the journal and can either be complete or incomplete at
the time of a crash. If a transaction is complete at the time of a crash
(or in the normal case where the system does not crash), then any blocks
in that transaction are guaranteed to represent a valid filesystem state,
and are copied into the filesystem. If a transaction is incomplete at
the time of the crash, then there is no guarantee of consistency for
the blocks in that transaction so they are discarded (which means any
filesystem changes they represent are also lost).
Check Documentation/filesystems/ext3.txt if you want to read more about
ext3 and journaling.
References
==========
The kernel source file:/usr/src/linux/fs/ext2/
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
Hashed Directories http://kernelnewbies.org/~phillips/htree/
Filesystem Resizing http://ext2resize.sourceforge.net/
Compression (*) http://www.netspace.net.au/~reiter/e2compr/
Implementations for:
Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm
Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2
DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/
RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/
(*) no longer actively developed/supported (as of Apr 2001)

View File

@@ -0,0 +1,183 @@
Ext3 Filesystem
===============
ext3 was originally released in September 1999. Written by Stephen Tweedie
for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
ext3 is ext2 filesystem enhanced with journalling capabilities.
Options
=======
When mounting an ext3 filesystem, the following option are accepted:
(*) == default
jounal=update Update the ext3 file system's journal to the
current format.
journal=inum When a journal already exists, this option is
ignored. Otherwise, it specifies the number of
the inode which will represent the ext3 file
system's journal file.
noload Don't load the journal on mounting.
data=journal All data are committed into the journal prior
to being written into the main file system.
data=ordered (*) All data are forced directly out to the main file
system prior to its metadata being committed to
the journal.
data=writeback Data ordering is not preserved, data may be
written into the main file system after its
metadata has been committed to the journal.
commit=nrsec (*) Ext3 can be told to sync all its data and metadata
every 'nrsec' seconds. The default value is 5 seconds.
This means that if you lose your power, you will lose,
as much, the latest 5 seconds of work (your filesystem
will not be damaged though, thanks to journaling). This
default value (or any low value) will hurt performance,
but it's good for data-safety. Setting it to 0 will
have the same effect than leaving the default 5 sec.
Setting it to very large values will improve
performance.
barrier=1 This enables/disables barriers. barrier=0 disables it,
barrier=1 enables it.
orlov (*) This enables the new Orlov block allocator. It's enabled
by default.
oldalloc This disables the Orlov block allocator and enables the
old block allocator. Orlov should have better performance,
we'd like to get some feedback if it's the contrary for
you.
user_xattr (*) Enables POSIX Extended Attributes. It's enabled by
default, however you need to confifure its support
(CONFIG_EXT3_FS_XATTR). This is neccesary if you want
to use POSIX Acces Control Lists support. You can visit
http://acl.bestbits.at to know more about POSIX Extended
attributes.
nouser_xattr Disables POSIX Extended Attributes.
acl (*) Enables POSIX Access Control Lists support. This is
enabled by default, however you need to configure
its support (CONFIG_EXT3_FS_POSIX_ACL). If you want
to know more about ACLs visit http://acl.bestbits.at
noacl This option disables POSIX Access Control List support.
reservation
noreservation
resize=
bsddf (*) Make 'df' act like BSD.
minixdf Make 'df' act like Minix.
check=none Don't do extra checking of bitmaps on mount.
nocheck
debug Extra debugging information is sent to syslog.
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
grpid Give objects the same group ID as their creator.
bsdgroups
nogrpid (*) New objects have the group ID of their creator.
sysvgroups
resgid=n The group ID which may use the reserved blocks.
resuid=n The user ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
quota Quota options are currently silently ignored.
noquota (see fs/ext3/super.c, line 594)
grpquota
usrquota
Specification
=============
ext3 shares all disk implementation with ext2 filesystem, and add
transactions capabilities to ext2. Journaling is done by the
Journaling block device layer.
Journaling Block Device layer
-----------------------------
The Journaling Block Device layer (JBD) isn't ext3 specific. It was
design to add journaling capabilities on a block device. The ext3
filesystem code will inform the JBD of modifications it is performing
(Call a transaction). the journal support the transactions start and
stop, and in case of crash, the journal can replayed the transactions
to put the partition on a consistent state fastly.
handles represent a single atomic update to a filesystem. JBD can
handle external journal on a block device.
Data Mode
---------
There's 3 different data modes:
* writeback mode
In data=writeback mode, ext3 does not journal data at all. This mode
provides a similar level of journaling as XFS, JFS, and ReiserFS in its
default mode - metadata journaling. A crash+recovery can cause
incorrect data to appear in files which were written shortly before the
crash. This mode will typically provide the best ext3 performance.
* ordered mode
In data=ordered mode, ext3 only officially journals metadata, but it
logically groups metadata and data blocks into a single unit called a
transaction. When it's time to write the new metadata out to disk, the
associated data blocks are written first. In general, this mode
perform slightly slower than writeback but significantly faster than
journal mode.
* journal mode
data=journal mode provides full data and metadata journaling. All new
data is written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both
data and metadata into a consistent state. This mode is the slowest
except when data needs to be read from and written to disk at the same
time where it outperform all others mode.
Compatibility
-------------
Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
Ext3 is fully compatible with Ext2. Ext3 partitions can easily be
mounted as Ext2.
External Tools
==============
see manual pages to know more.
tune2fs: create a ext3 journal on a ext2 partition with the -j flags
mke2fs: create a ext3 partition with the -j flags
debugfs: ext2 and ext3 file system debugger
References
==========
kernel source: file:/usr/src/linux/fs/ext3
file:/usr/src/linux/fs/jbd
programs: http://e2fsprogs.sourceforge.net
useful link:
http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
http://www-106.ibm.com/developerworks/linux/library/l-fs7/
http://www-106.ibm.com/developerworks/linux/library/l-fs8/

View File

@@ -0,0 +1,83 @@
Macintosh HFS Filesystem for Linux
==================================
HFS stands for ``Hierarchical File System'' and is the filesystem used
by the Mac Plus and all later Macintosh models. Earlier Macintosh
models used MFS (``Macintosh File System''), which is not supported,
MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
HFS but is extended in various areas. Use the hfsplus filesystem driver
to access such filesystems from Linux.
Mount options
=============
When mounting an HFS filesystem, the following options are accepted:
creator=cccc, type=cccc
Specifies the creator/type values as shown by the MacOS finder
used for creating new files. Default values: '????'.
uid=n, gid=n
Specifies the user/group that owns all files on the filesystems.
Default: user/group id of the mounting process.
dir_umask=n, file_umask=n, umask=n
Specifies the umask used for all files , all directories or all
files and directories. Defaults to the umask of the mounting process.
session=n
Select the CDROM session to mount as HFS filesystem. Defaults to
leaving that decision to the CDROM driver. This option will fail
with anything but a CDROM as underlying devices.
part=n
Select partition number n from the devices. Does only makes
sense for CDROMS because they can't be partitioned under Linux.
For disk devices the generic partition parsing code does this
for us. Defaults to not parsing the partition table at all.
quiet
Ignore invalid mount options instead of complaining.
Writing to HFS Filesystems
==========================
HFS is not a UNIX filesystem, thus it does not have the usual features you'd
expect:
o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
and gid of files.
o You can't create hard- or symlinks, device files, sockets or FIFOs.
HFS does on the other have the concepts of multiple forks per file. These
non-standard forks are represented as hidden additional files in the normal
filesystems namespace which is kind of a cludge and makes the semantics for
the a little strange:
o You can't create, delete or rename resource forks of files or the
Finder's metadata.
o They are however created (with default values), deleted and renamed
along with the corresponding data fork or directory.
o Copying files to a different filesystem will loose those attributes
that are essential for MacOS to work.
Creating HFS filesystems
===================================
The hfsutils package from Robert Leslie contains a program called
hformat that can be used to create HFS filesystem. See
<http://www.mars.org/home/rob/proj/hfs/> for details.
Credits
=======
The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU)
and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis
Technologies.
Roman rewrote large parts of the code and brought in btree routines derived
from Brad Boyer's hfsplus driver (also maintained by Roman now).

View File

@@ -0,0 +1,296 @@
Read/Write HPFS 2.09
1998-2004, Mikulas Patocka
email: mikulas@artax.karlin.mff.cuni.cz
homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
CREDITS:
Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
is taken from it
Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
Mount options
uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask)
Set owner/group/mode for files that do not have it specified in extended
attributes. Mode is inverted umask - for example umask 027 gives owner
all permission, group read permission and anybody else no access. Note
that for files mode is anded with 0666. If you want files to have 'x'
rights, you must use extended attributes.
case=lower,asis (default asis)
File name lowercasing in readdir.
conv=binary,text,auto (default binary)
CR/LF -> LF conversion, if auto, decision is made according to extension
- there is a list of text extensions (I thing it's better to not convert
text file than to damage binary file). If you want to change that list,
change it in the source. Original readonly HPFS contained some strange
heuristic algorithm that I removed. I thing it's danger to let the
computer decide whether file is text or binary. For example, DJGPP
binaries contain small text message at the beginning and they could be
misidentified and damaged under some circumstances.
check=none,normal,strict (default normal)
Check level. Selecting none will cause only little speedup and big
danger. I tried to write it so that it won't crash if check=normal on
corrupted filesystems. check=strict means many superfluous checks -
used for debugging (for example it checks if file is allocated in
bitmaps when accessing it).
errors=continue,remount-ro,panic (default remount-ro)
Behaviour when filesystem errors found.
chkdsk=no,errors,always (default errors)
When to mark filesystem dirty so that OS/2 checks it.
eas=no,ro,rw (default rw)
What to do with extended attributes. 'no' - ignore them and use always
values specified in uid/gid/mode options. 'ro' - read extended
attributes but do not create them. 'rw' - create extended attributes
when you use chmod/chown/chgrp/mknod/ln -s on the filesystem.
timeshift=(-)nnn (default 0)
Shifts the time by nnn seconds. For example, if you see under linux
one hour more, than under os/2, use timeshift=-3600.
File names
As in OS/2, filenames are case insensitive. However, shell thinks that names
are case sensitive, so for example when you create a file FOO, you can use
'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
also won't be able to compile linux kernel (and maybe other things) on HPFS
because kernel creates different files with names like bootsect.S and
bootsect.s. When searching for file thats name has characters >= 128, codepages
are used - see below.
OS/2 ignores dots and spaces at the end of file name, so this driver does as
well. If you create 'a. ...', the file 'a' will be created, but you can still
access it under names 'a.', 'a..', 'a . . . ' etc.
Extended attributes
On HPFS partitions, OS/2 can associate to each file a special information called
extended attributes. Extended attributes are pairs of (key,value) where key is
an ascii string identifying that attribute and value is any string of bytes of
variable length. OS/2 stores window and icon positions and file types there. So
why not use it for unix-specific info like file owner or access rights? This
driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended
attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only
that extended attributes those value differs from defaults specified in mount
options are created. Once created, the extended attributes are never deleted,
they're just changed. It means that when your default uid=0 and you type
something like 'chown luser file; chown root file' the file will contain
extended attribute UID=0. And when you umount the fs and mount it again with
uid=luser_uid, the file will be still owned by root! If you chmod file to 444,
extended attribute "MODE" will not be set, this special case is done by setting
read-only flag. When you mknod a block or char device, besides "MODE", the
special 4-byte extended attribute "DEV" will be created containing the device
number. Currently this driver cannot resize extended attributes - it means
that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV"
attributes with different sizes, they won't be rewritten and changing these
values doesn't work.
Symlinks
You can do symlinks on HPFS partition, symlinks are achieved by setting extended
attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
chgrp symlinks but I don't know what is it good for. chmoding symlink results
in chmoding file where symlink points. These symlinks are just for Linux use and
incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are
stored in very crazy way. They tried to do it so that link changes when file is
moved ... sometimes it works. But the link is partly stored in directory
extended attributes and partly in OS2SYS.INI. I don't want (and don't know how)
to analyze or change OS2SYS.INI.
Codepages
HPFS can contain several uppercasing tables for several codepages and each
file has a pointer to codepage it's name is in. However OS/2 was created in
America where people don't care much about codepages and so multiple codepages
support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
Once I booted English OS/2 working in cp 850 and I created a file on my 852
partition. It marked file name codepage as 850 - good. But when I again booted
Czech OS/2, the file was completely inaccessible under any name. It seems that
OS/2 uppercases the search pattern with its system code page (852) and file
name it's comparing to with its code page (850). These could never match. Is it
really what IBM developers wanted? But problems continued. When I created in
Czech OS/2 another file in that directory, that file was inaccessible too. OS/2
probably uses different uppercasing method when searching where to place a file
(note, that files in HPFS directory must be sorted) and when searching for
a file. Finally when I opened this directory in PmShell, PmShell crashed (the
funny thing was that, when rebooted, PmShell tried to reopen this directory
again :-). chkdsk happily ignores these errors and only low-level disk
modification saved me. Never mix different language versions of OS/2 on one
system although HPFS was designed to allow that.
OK, I could implement complex codepage support to this driver but I think it
would cause more problems than benefit with such buggy implementation in OS/2.
So this driver simply uses first codepage it finds for uppercasing and
lowercasing no matter what's file codepage index. Usually all file names are in
this codepage - if you don't try to do what I described above :-)
Known bugs
HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
should work. If you have OS/2 server, use only read-only mode. I don't know how
to handle some HPFS386 structures like access control list or extended perm
list, I don't know how to delete them when file is deleted and how to not
overwrite them with extended attributes. Send me some info on these structures
and I'll make it. However, this driver should detect presence of HPFS386
structures, remount read-only and not destroy them (I hope).
When there's not enough space for extended attributes, they will be truncated
and no error is returned.
OS/2 can't access files if the path is longer than about 256 chars but this
driver allows you to do it. chkdsk ignores such errors.
Sometimes you won't be able to delete some files on a very full filesystem
(returning error ENOSPC). That's because file in non-leaf node in directory tree
(one directory, if it's large, has dirents in tree on HPFS) must be replaced
with another node when deleted. And that new file might have larger name than
the old one so the new name doesn't fit in directory node (dnode). And that
would result in directory tree splitting, that takes disk space. Workaround is
to delete other files that are leaf (probability that the file is non-leaf is
about 1/50) or to truncate file first to make some space.
You encounter this problem only if you have many directories so that
preallocated directory band is full i.e.
number_of_directories / size_of_filesystem_in_mb > 4.
You can't delete open directories.
You can't rename over directories (what is it good for?).
Renaming files so that only case changes doesn't work. This driver supports it
but vfs doesn't. Something like 'mv file FILE' won't work.
All atimes and directory mtimes are not updated. That's because of performance
reasons. If you extremely wish to update them, let me know, I'll write it (but
it will be slow).
When the system is out of memory and swap, it may slightly corrupt filesystem
(lost files, unbalanced directories). (I guess all filesystem may do it).
When compiled, you get warning: function declaration isn't a prototype. Does
anybody know what does it mean?
What does "unbalanced tree" message mean?
Old versions of this driver created sometimes unbalanced dnode trees. OS/2
chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely
crashes when the tree is not balanced. This driver handles unbalanced trees
correctly and writes warning if it finds them. If you see this message, this is
probably because of directories created with old version of this driver.
Workaround is to move all files from that directory to another and then back
again. Do it in Linux, not OS/2! If you see this message in directory that is
whole created by this driver, it is BUG - let me know about it.
Bugs in OS/2
When you have two (or more) lost directories pointing each to other, chkdsk
locks up when repairing filesystem.
Sometimes (I think it's random) when you create a file with one-char name under
OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs
error corrected".
File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
marks them as short (and writes "minor fs error corrected"). This bug is not in
HPFS386.
Codepage bugs described above.
If you don't install fixpacks, there are many, many more...
History
0.90 First public release
0.91 Fixed bug that caused shooting to memory when write_inode was called on
open inode (rarely happened)
0.92 Fixed a little memory leak in freeing directory inodes
0.93 Fixed bug that locked up the machine when there were too many filenames
with first 15 characters same
Fixed write_file to zero file when writing behind file end
0.94 Fixed a little memory leak when trying to delete busy file or directory
0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
1.90 First version for 2.1.1xx kernels
1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
Fixed a race-condition when write_inode is called while deleting file
Fixed a bug that could possibly happen (with very low probability) when
using 0xff in filenames
Rewritten locking to avoid race-conditions
Mount option 'eas' now works
Fsync no longer returns error
Files beginning with '.' are marked hidden
Remount support added
Alloc is not so slow when filesystem becomes full
Atimes are no more updated because it slows down operation
Code cleanup (removed all commented debug prints)
1.92 Corrected a bug when sync was called just before closing file
1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
works with previous versions
Fixed a possible problem with disks > 64G (but I don't have one, so I can't
test it)
Fixed a file overflow at 2G
Added new option 'timeshift'
Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
read-only mode
Fixed a bug that slowed down alloc and prevented allocating 100% space
(this bug was not destructive)
1.94 Added workaround for one bug in Linux
Fixed one buffer leak
Fixed some incompatibilities with large extended attributes (but it's still
not 100% ok, I have no info on it and OS/2 doesn't want to create them)
Rewritten allocation
Fixed a bug with i_blocks (du sometimes didn't display correct values)
Directories have no longer archive attribute set (some programs don't like
it)
Fixed a bug that it set badly one flag in large anode tree (it was not
destructive)
1.95 Fixed one buffer leak, that could happen on corrupted filesystem
Fixed one bug in allocation in 1.94
1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
error sometimes when opening directories in PMSHELL)
Fixed a possible bitmap race
Fixed possible problem on large disks
You can now delete open files
Fixed a nondestructive race in rename
1.97 Support for HPFS v3 (on large partitions)
Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
1.97.1 Changed names of global symbols
Fixed a bug when chmoding or chowning root directory
1.98 Fixed a deadlock when using old_readdir
Better directory handling; workaround for "unbalanced tree" bug in OS/2
1.99 Corrected a possible problem when there's not enough space while deleting
file
Now it tries to truncate the file if there's not enough space when deleting
Removed a lot of redundant code
2.00 Fixed a bug in rename (it was there since 1.96)
Better anti-fragmentation strategy
2.01 Fixed problem with directory listing over NFS
Directory lseek now checks for proper parameters
Fixed race-condition in buffer code - it is in all filesystems in Linux;
when reading device (cat /dev/hda) while creating files on it, files
could be damaged
2.02 Woraround for bug in breada in Linux. breada could cause accesses beyond
end of partition
2.03 Char, block devices and pipes are correctly created
Fixed non-crashing race in unlink (Alexander Viro)
Now it works with Japanese version of OS/2
2.04 Fixed error when ftruncate used to extend file
2.05 Fixed crash when got mount parameters without =
Fixed crash when allocation of anode failed due to full disk
Fixed some crashes when block io or inode allocation failed
2.06 Fixed some crash on corrupted disk structures
Better allocation strategy
Reschedule points added so that it doesn't lock CPU long time
It should work in read-only mode on Warp Server
2.07 More fixes for Warp Server. Now it really works
2.08 Creating new files is not so slow on large disks
An attempt to sync deleted file does not generate filesystem error
2.09 Fixed error on extremly fragmented files
vim: set textwidth=80:

View File

@@ -0,0 +1,38 @@
Mount options that are the same as for msdos and vfat partitions.
gid=nnn All files in the partition will be in group nnn.
uid=nnn All files in the partition will be owned by user id nnn.
umask=nnn The permission mask (see umask(1)) for the partition.
Mount options that are the same as vfat partitions. These are only useful
when using discs encoded using Microsoft's Joliet extensions.
iocharset=name Character set to use for converting from Unicode to
ASCII. Joliet filenames are stored in Unicode format, but
Unix for the most part doesn't know how to deal with Unicode.
There is also an option of doing UTF8 translations with the
utf8 option.
utf8 Encode Unicode names in UTF8 format. Default is no.
Mount options unique to the isofs filesystem.
block=512 Set the block size for the disk to 512 bytes
block=1024 Set the block size for the disk to 1024 bytes
block=2048 Set the block size for the disk to 2048 bytes
check=relaxed Matches filenames with different cases
check=strict Matches only filenames with the exact same case
cruft Try to handle badly formatted CDs.
map=off Do not map non-Rock Ridge filenames to lower case
map=normal Map non-Rock Ridge filenames to lower case
map=acorn As map=normal but also apply Acorn extensions if present
mode=xxx Sets the permissions on files to xxx
nojoliet Ignore Joliet extensions if they are present.
norock Ignore Rock Ridge extensions if they are present.
unhide Show hidden files.
session=x Select number of session on multisession CD
sbsector=xxx Session begins from sector xxx
Recommended documents about ISO 9660 standard are located at:
http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm
ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
identical with ISO 9660.", so it is a valid and gratis substitute of the
official ISO specification.

View File

@@ -0,0 +1,35 @@
IBM's Journaled File System (JFS) for Linux
JFS Homepage: http://jfs.sourceforge.net/
The following mount options are supported:
iocharset=name Character set to use for converting from Unicode to
ASCII. The default is to do no conversion. Use
iocharset=utf8 for UTF8 translations. This requires
CONFIG_NLS_UTF8 to be set in the kernel .config file.
iocharset=none specifies the default behavior explicitly.
resize=value Resize the volume to <value> blocks. JFS only supports
growing a volume, not shrinking it. This option is only
valid during a remount, when the volume is mounted
read-write. The resize keyword with no value will grow
the volume to the full size of the partition.
nointegrity Do not write to the journal. The primary use of this option
is to allow for higher performance when restoring a volume
from backup media. The integrity of the volume is not
guaranteed if the system abnormally abends.
integrity Default. Commit metadata changes to the journal. Use this
option to remount a volume where the nointegrity option was
previously specified in order to restore normal behavior.
errors=continue Keep going on a filesystem error.
errors=remount-ro Default. Remount the filesystem read-only on an error.
errors=panic Panic and halt the machine if an error occurs.
Please send bugs, comments, cards and letters to shaggy@austin.ibm.com.
The JFS mailing list can be subscribed to by using the link labeled
"Mail list Subscribe" at our web page http://jfs.sourceforge.net/

View File

@@ -0,0 +1,12 @@
The ncpfs filesystem understands the NCP protocol, designed by the
Novell Corporation for their NetWare(tm) product. NCP is functionally
similar to the NFS used in the TCP/IP community.
To mount a NetWare filesystem, you need a special mount program, which
can be found in the ncpfs package. The home site for ncpfs is
ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors
will have it as well.
Related products are linware and mars_nwe, which will give Linux partial
NetWare server functionality. Linware's home site is
klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on
ftp.gwdg.de/pub/linux/misc/ncpfs.

View File

@@ -0,0 +1,630 @@
The Linux NTFS filesystem driver
================================
Table of contents
=================
- Overview
- Web site
- Features
- Supported mount options
- Known bugs and (mis-)features
- Using NTFS volume and stripe sets
- The Device-Mapper driver
- The Software RAID / MD driver
- Limitiations when using the MD driver
- ChangeLog
Overview
========
Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
These include mkntfs, a full-featured ntfs file system format utility,
ntfsundelete used for recovering files that were unintentionally deleted
from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
See the web site for more information.
To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
system type 'ntfs'. The driver currently supports read-only mode (with no
fault-tolerance, encryption or journalling) and very limited, but safe, write
support.
For fault tolerance and raid support (i.e. volume and stripe sets), you can
use the kernel's Software RAID / MD driver. See section "Using Software RAID
with NTFS" for details.
Web site
========
There is plenty of additional information on the linux-ntfs web site
at http://linux-ntfs.sourceforge.net/
The web site has a lot of additional information, such as a comprehensive
FAQ, documentation on the NTFS on-disk format, informaiton on the Linux-NTFS
userspace utilities, etc.
Features
========
- This is a complete rewrite of the NTFS driver that used to be in the kernel.
This new driver implements NTFS read support and is functionally equivalent
to the old ntfs driver.
- The new driver has full support for sparse files on NTFS 3.x volumes which
the old driver isn't happy with.
- The new driver supports execution of binaries due to mmap() now being
supported.
- The new driver supports loopback mounting of files on NTFS which is used by
some Linux distributions to enable the user to run Linux from an NTFS
partition by creating a large file while in Windows and then loopback
mounting the file while in Linux and creating a Linux filesystem on it that
is used to install Linux on it.
- A comparison of the two drivers using:
time find . -type f -exec md5sum "{}" \;
run three times in sequence with each driver (after a reboot) on a 1.4GiB
NTFS partition, showed the new driver to be 20% faster in total time elapsed
(from 9:43 minutes on average down to 7:53). The time spent in user space
was unchanged but the time spent in the kernel was decreased by a factor of
2.5 (from 85 CPU seconds down to 33).
- The driver does not support short file names in general. For backwards
compatibility, we implement access to files using their short file names if
they exist. The driver will not create short file names however, and a
rename will discard any existing short file name.
- The new driver supports exporting of mounted NTFS volumes via NFS.
- The new driver supports async io (aio).
- The new driver supports fsync(2), fdatasync(2), and msync(2).
- The new driver supports readv(2) and writev(2).
- The new driver supports access time updates (including mtime and ctime).
Supported mount options
=======================
In addition to the generic mount options described by the manual page for the
mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
following mount options:
iocharset=name Deprecated option. Still supported but please use
nls=name in the future. See description for nls=name.
nls=name Character set to use when returning file names.
Unlike VFAT, NTFS suppresses names that contain
unconvertible characters. Note that most character
sets contain insufficient characters to represent all
possible Unicode characters that can exist on NTFS.
To be sure you are not missing any files, you are
advised to use nls=utf8 which is capable of
representing all Unicode characters.
utf8=<bool> Option no longer supported. Currently mapped to
nls=utf8 but please use nls=utf8 in the future and
make sure utf8 is compiled either as module or into
the kernel. See description for nls=name.
uid=
gid=
umask= Provide default owner, group, and access mode mask.
These options work as documented in mount(8). By
default, the files/directories are owned by root and
he/she has read and write permissions, as well as
browse permission for directories. No one else has any
access permissions. I.e. the mode on all files is by
default rw------- and for directories rwx------, a
consequence of the default fmask=0177 and dmask=0077.
Using a umask of zero will grant all permissions to
everyone, i.e. all files and directories will have mode
rwxrwxrwx.
fmask=
dmask= Instead of specifying umask which applies both to
files and directories, fmask applies only to files and
dmask only to directories.
sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
Otherwise the default behaviour is to abort mount if
any unknown options are found.
show_sys_files=<BOOL> If show_sys_files is specified, show the system files
in directory listings. Otherwise the default behaviour
is to hide the system files.
Note that even when show_sys_files is specified, "$MFT"
will not be visible due to bugs/mis-features in glibc.
Further, note that irrespective of show_sys_files, all
files are accessible by name, i.e. you can always do
"ls -l \$UpCase" for example to specifically show the
system file containing the Unicode upcase table.
case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
case sensitive and create file names in the POSIX
namespace. Otherwise the default behaviour is to treat
file names as case insensitive and to create file names
in the WIN32/LONG name space. Note, the Linux NTFS
driver will never create short file names and will
remove them on rename/delete of the corresponding long
file name.
Note that files remain accessible via their short file
name, if it exists. If case_sensitive, you will need
to provide the correct case of the short file name.
errors=opt What to do when critical file system errors are found.
Following values can be used for "opt":
continue: DEFAULT, try to clean-up as much as
possible, e.g. marking a corrupt inode as
bad so it is no longer accessed, and then
continue.
recover: At present only supported is recovery of
the boot sector from the backup copy.
If read-only mount, the recovery is done
in memory only and not written to disk.
Note that the options are additive, i.e. specifying:
errors=continue,errors=recover
means the driver will attempt to recover and if that
fails it will clean-up as much as possible and
continue.
mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
setting is not persistent across mounts and can be
changed from mount to mount but cannot be changed on
remount). Values of 1 to 4 are allowed, 1 being the
default. The MFT zone multiplier determines how much
space is reserved for the MFT on the volume. If all
other space is used up, then the MFT zone will be
shrunk dynamically, so this has no impact on the
amount of free space. However, it can have an impact
on performance by affecting fragmentation of the MFT.
In general use the default. If you have a lot of small
files then use a higher value. The values have the
following meaning:
Value MFT zone size (% of volume size)
1 12.5%
2 25%
3 37.5%
4 50%
Note this option is irrelevant for read-only mounts.
Known bugs and (mis-)features
=============================
- The link count on each directory inode entry is set to 1, due to Linux not
supporting directory hard links. This may well confuse some user space
applications, since the directory names will have the same inode numbers.
This also speeds up ntfs_read_inode() immensely. And we haven't found any
problems with this approach so far. If you find a problem with this, please
let us know.
Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
Using NTFS volume and stripe sets
=================================
For support of volume and stripe sets, you can either use the kernel's
Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
the recommended one to use for linear raid. But the latter is required for
raid level 5. For striping and mirroring, either driver should work fine.
The Device-Mapper driver
------------------------
You will need to create a table of the components of the volume/stripe set and
how they fit together and load this into the kernel using the dmsetup utility
(see man 8 dmsetup).
Linear volume sets, i.e. linear raid, has been tested and works fine. Even
though untested, there is no reason why stripe sets, i.e. raid level 0, and
mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
raid level 5, unfortunately cannot work yet because the current version of the
Device-Mapper driver does not support raid level 5. You may be able to use the
Software RAID / MD driver for raid level 5, see the next section for details.
To create the table describing your volume you will need to know each of its
components and their sizes in sectors, i.e. multiples of 512-byte blocks.
For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
example if one of your partitions is /dev/hda2 you would do:
$ fdisk -ul /dev/hda
Disk /dev/hda: 81.9 GB, 81964302336 bytes
255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 63 4209029 2104483+ 83 Linux
/dev/hda2 4209030 37768814 16779892+ 86 NTFS
/dev/hda3 37768815 46170809 4200997+ 83 Linux
And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
33559785 sectors.
For Win2k and later dynamic disks, you can for example use the ldminfo utility
which is part of the Linux LDM tools (the latest version at the time of
writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
http://linux-ntfs.sourceforge.net/downloads.html
Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
able to compile this yourself easily so use the binary version!
Then you would use ldminfo in dump mode to obtain the necessary information:
$ ./ldminfo --dump /dev/hda
This would dump the LDM database found on /dev/hda which describes all of your
dynamic disks and all the volumes on them. At the bottom you will see the
VOLUME DEFINITIONS section which is all you really need. You may need to look
further above to determine which of the disks in the volume definitions is
which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
section). You can then find these Disk Ids in the VBLK DATABASE section in the
<Disk> components where you will get the LDM Name for the disk that is found in
the VOLUME DEFINITIONS section.
Note you will also need to enable the LDM driver in the Linux kernel. If your
distribution did not enable it, you will need to recompile the kernel with it
enabled. This will create the LDM partitions on each device at boot time. You
would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
in the Device-Mapper table.
You can also bypass using the LDM driver by using the main device (e.g.
/dev/hda) and then using the offsets of the LDM partitions into this device as
the "Start sector of device" when creating the table. Once again ldminfo would
give you the correct information to do this.
Assuming you know all your devices and their sizes things are easy.
For a linear raid the table would look like this (note all values are in
512-byte sectors):
--- cut here ---
# Offset into Size of this Raid type Device Start sector
# volume device of device
0 1028161 linear /dev/hda1 0
1028161 3903762 linear /dev/hdb2 0
4931923 2103211 linear /dev/hdc1 0
--- cut here ---
For a striped volume, i.e. raid level 0, you will need to know the chunk size
you used when creating the volume. Windows uses 64kiB as the default, so it
will probably be this unless you changes the defaults when creating the array.
For a raid level 0 the table would look like this (note all values are in
512-byte sectors):
--- cut here ---
# Offset Size Raid Number Chunk 1st Start 2nd Start
# into of the type of size Device in Device in
# volume volume stripes device device
0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
If there are more than two devices, just add each of them to the end of the
line.
Finally, for a mirrored volume, i.e. raid level 1, the table would look like
this (note all values are in 512-byte sectors):
--- cut here ---
# Ofs Size Raid Log Number Region Should Number Source Start Taget Start
# in of the type type of log size sync? of Device in Device in
# vol volume params mirrors Device Device
0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
If you are mirroring to multiple devices you can specify further targets at the
end of the line.
Note the "Should sync?" parameter "nosync" means that the two mirrors are
already in sync which will be the case on a clean shutdown of Windows. If the
mirrors are not clean, you can specify the "sync" option instead of "nosync"
and the Device-Mapper driver will then copy the entirey of the "Source Device"
to the "Target Device" or if you specified multipled target devices to all of
them.
Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
and hand it over to dmsetup to work with, like so:
$ dmsetup create myvolume1 /etc/ntfsvolume1
You can obviously replace "myvolume1" with whatever name you like.
If it all worked, you will now have the device /dev/device-mapper/myvolume1
which you can then just use as an argument to the mount command as usual to
mount the ntfs volume. For example:
$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
(You need to create the directory /mnt/myvol1 first and of course you can use
anything you like instead of /mnt/myvol1 as long as it is an existing
directory.)
It is advisable to do the mount read-only to see if the volume has been setup
correctly to avoid the possibility of causing damage to the data on the ntfs
volume.
The Software RAID / MD driver
-----------------------------
An alternative to using the Device-Mapper driver is to use the kernel's
Software RAID / MD driver. For which you need to set up your /etc/raidtab
appropriately (see man 5 raidtab).
Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
0, have been tested and work fine (though see section "Limitiations when using
the MD driver with NTFS volumes" especially if you want to use linear raid).
Even though untested, there is no reason why mirrors, i.e. raid level 1, and
stripes with parity, i.e. raid level 5, should not work, too.
You have to use the "persistent-superblock 0" option for each raid-disk in the
NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
superblock used by the MD driver would damange the NTFS volume.
Windows by default uses a stripe chunk size of 64k, so you probably want the
"chunk-size 64k" option for each raid-disk, too.
For example, if you have a stripe set consisting of two partitions /dev/hda5
and /dev/hdb1 your /etc/raidtab would look like this:
raiddev /dev/md0
raid-level 0
nr-raid-disks 2
nr-spare-disks 0
persistent-superblock 0
chunk-size 64k
device /dev/hda5
raid-disk 0
device /dev/hdb1
raid-disl 1
For linear raid, just change the raid-level above to "raid-level linear", for
mirrors, change it to "raid-level 1", and for stripe sets with parity, change
it to "raid-level 5".
Note for stripe sets with parity you will also need to tell the MD driver
which parity algorithm to use by specifying the option "parity-algorithm
which", where you need to replace "which" with the name of the algorithm to
use (see man 5 raidtab for available algorithms) and you will have to try the
different available algorithms until you find one that works. Make sure you
are working read-only when playing with this as you may damage your data
otherwise. If you find which algorithm works please let us know (email the
linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
IRC in channel #ntfs on the irc.freenode.net network) so we can update this
documentation.
Once the raidtab is setup, run for example raid0run -a to start all devices or
raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
Then just use the mount command as usual to mount the ntfs volume using for
example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
It is advisable to do the mount read-only to see if the md volume has been
setup correctly to avoid the possibility of causing damage to the data on the
ntfs volume.
Limitiations when using the Software RAID / MD driver
-----------------------------------------------------
Using the md driver will not work properly if any of your NTFS partitions have
an odd number of sectors. This is especially important for linear raid as all
data after the first partition with an odd number of sectors will be offset by
one or more sectors so if you mount such a partition with write support you
will cause massive damage to the data on the volume which will only become
apparent when you try to use the volume again under Windows.
So when using linear raid, make sure that all your partitions have an even
number of sectors BEFORE attempting to use it. You have been warned!
Even better is to simply use the Device-Mapper for linear raid and then you do
not have this problem with odd numbers of sectors.
ChangeLog
=========
Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
2.1.22:
- Improve handling of ntfs volumes with errors.
- Fix various bugs and race conditions.
2.1.21:
- Fix several race conditions and various other bugs.
- Many internal cleanups, code reorganization, optimizations, and mft
and index record writing code rewritten to fit in with the changes.
- Update Documentation/filesystems/ntfs.txt with instructions on how to
use the Device-Mapper driver with NTFS ftdisk/LDM raid.
2.1.20:
- Fix two stupid bugs introduced in 2.1.18 release.
2.1.19:
- Minor bugfix in handling of the default upcase table.
- Many internal cleanups and improvements. Many thanks to Linus
Torvalds and Al Viro for the help and advice with the sparse
annotations and cleanups.
2.1.18:
- Fix scheduling latencies at mount time. (Ingo Molnar)
- Fix endianness bug in a little traversed portion of the attribute
lookup code.
2.1.17:
- Fix bugs in mount time error code paths.
2.1.16:
- Implement access time updates (including mtime and ctime).
- Implement fsync(2), fdatasync(2), and msync(2) system calls.
- Enable the readv(2) and writev(2) system calls.
- Enable access via the asynchronous io (aio) API by adding support for
the aio_read(3) and aio_write(3) functions.
2.1.15:
- Invalidate quotas when (re)mounting read-write.
NOTE: This now only leave user space journalling on the side. (See
note for version 2.1.13, below.)
2.1.14:
- Fix an NFSd caused deadlock reported by several users.
2.1.13:
- Implement writing of inodes (access time updates are not implemented
yet so mounting with -o noatime,nodiratime is enforced).
- Enable writing out of resident files so you can now overwrite any
uncompressed, unencrypted, nonsparse file as long as you do not
change the file size.
- Add housekeeping of ntfs system files so that ntfsfix no longer needs
to be run after writing to an NTFS volume.
NOTE: This still leaves quota tracking and user space journalling on
the side but they should not cause data corruption. In the worst
case the charged quotas will be out of date ($Quota) and some
userspace applications might get confused due to the out of date
userspace journal ($UsnJrnl).
2.1.12:
- Fix the second fix to the decompression engine from the 2.1.9 release
and some further internals cleanups.
2.1.11:
- Driver internal cleanups.
2.1.10:
- Force read-only (re)mounting of volumes with unsupported volume
flags and various cleanups.
2.1.9:
- Fix two bugs in handling of corner cases in the decompression engine.
2.1.8:
- Read the $MFT mirror and compare it to the $MFT and if the two do not
match, force a read-only mount and do not allow read-write remounts.
- Read and parse the $LogFile journal and if it indicates that the
volume was not shutdown cleanly, force a read-only mount and do not
allow read-write remounts. If the $LogFile indicates a clean
shutdown and a read-write (re)mount is requested, empty $LogFile to
ensure that Windows cannot cause data corruption by replaying a stale
journal after Linux has written to the volume.
- Improve time handling so that the NTFS time is fully preserved when
converted to kernel time and only up to 99 nano-seconds are lost when
kernel time is converted to NTFS time.
2.1.7:
- Enable NFS exporting of mounted NTFS volumes.
2.1.6:
- Fix minor bug in handling of compressed directories that fixes the
erroneous "du" and "stat" output people reported.
2.1.5:
- Minor bug fix in attribute list attribute handling that fixes the
I/O errors on "ls" of certain fragmented files found by at least two
people running Windows XP.
2.1.4:
- Minor update allowing compilation with all gcc versions (well, the
ones the kernel can be compiled with anyway).
2.1.3:
- Major bug fixes for reading files and volumes in corner cases which
were being hit by Windows 2k/XP users.
2.1.2:
- Major bug fixes aleviating the hangs in statfs experienced by some
users.
2.1.1:
- Update handling of compressed files so people no longer get the
frequently reported warning messages about initialized_size !=
data_size.
2.1.0:
- Add configuration option for developmental write support.
- Initial implementation of file overwriting. (Writes to resident files
are not written out to disk yet, so avoid writing to files smaller
than about 1kiB.)
- Intercept/abort changes in file size as they are not implemented yet.
2.0.25:
- Minor bugfixes in error code paths and small cleanups.
2.0.24:
- Small internal cleanups.
- Support for sendfile system call. (Christoph Hellwig)
2.0.23:
- Massive internal locking changes to mft record locking. Fixes
various race conditions and deadlocks.
- Fix ntfs over loopback for compressed files by adding an
optimization barrier. (gcc was screwing up otherwise ?)
Thanks go to Christoph Hellwig for pointing these two out:
- Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs().
- Fix ntfs_free() for ia64 and parisc.
2.0.22:
- Small internal cleanups.
2.0.21:
These only affect 32-bit architectures:
- Check for, and refuse to mount too large volumes (maximum is 2TiB).
- Check for, and refuse to open too large files and directories
(maximum is 16TiB).
2.0.20:
- Support non-resident directory index bitmaps. This means we now cope
with huge directories without problems.
- Fix a page leak that manifested itself in some cases when reading
directory contents.
- Internal cleanups.
2.0.19:
- Fix race condition and improvements in block i/o interface.
- Optimization when reading compressed files.
2.0.18:
- Fix race condition in reading of compressed files.
2.0.17:
- Cleanups and optimizations.
2.0.16:
- Fix stupid bug introduced in 2.0.15 in new attribute inode API.
- Big internal cleanup replacing the mftbmp access hacks by using the
new attribute inode API instead.
2.0.15:
- Bug fix in parsing of remount options.
- Internal changes implementing attribute (fake) inodes allowing all
attribute i/o to go via the page cache and to use all the normal
vfs/mm functionality.
2.0.14:
- Internal changes improving run list merging code and minor locking
change to not rely on BKL in ntfs_statfs().
2.0.13:
- Internal changes towards using iget5_locked() in preparation for
fake inodes and small cleanups to ntfs_volume structure.
2.0.12:
- Internal cleanups in address space operations made possible by the
changes introduced in the previous release.
2.0.11:
- Internal updates and cleanups introducing the first step towards
fake inode based attribute i/o.
2.0.10:
- Microsoft says that the maximum number of inodes is 2^32 - 1. Update
the driver accordingly to only use 32-bits to store inode numbers on
32-bit architectures. This improves the speed of the driver a little.
2.0.9:
- Change decompression engine to use a single buffer. This should not
affect performance except perhaps on the most heavy i/o on SMP
systems when accessing multiple compressed files from multiple
devices simultaneously.
- Minor updates and cleanups.
2.0.8:
- Remove now obsolete show_inodes and posix mount option(s).
- Restore show_sys_files mount option.
- Add new mount option case_sensitive, to determine if the driver
treats file names as case sensitive or not.
- Mostly drop support for short file names (for backwards compatibility
we only support accessing files via their short file name if one
exists).
- Fix dcache aliasing issues wrt short/long file names.
- Cleanups and minor fixes.
2.0.7:
- Just cleanups.
2.0.6:
- Major bugfix to make compatible with other kernel changes. This fixes
the hangs/oopses on umount.
- Locking cleanup in directory operations (remove BKL usage).
2.0.5:
- Major buffer overflow bug fix.
- Minor cleanups and updates for kernel 2.5.12.
2.0.4:
- Cleanups and updates for kernel 2.5.11.
2.0.3:
- Small bug fixes, cleanups, and performance improvements.
2.0.2:
- Use default fmask of 0177 so that files are no executable by default.
If you want owner executable files, just use fmask=0077.
- Update for kernel 2.5.9 but preserve backwards compatibility with
kernel 2.5.7.
- Minor bug fixes, cleanups, and updates.
2.0.1:
- Minor updates, primarily set the executable bit by default on files
so they can be executed.
2.0.0:
- Started ChangeLog.

View File

@@ -0,0 +1,266 @@
Changes since 2.5.0:
---
[recommended]
New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
sb_set_blocksize() and sb_min_blocksize().
Use them.
(sb_find_get_block() replaces 2.4's get_hash_table())
---
[recommended]
New methods: ->alloc_inode() and ->destroy_inode().
Remove inode->u.foo_inode_i
Declare
struct foo_inode_info {
/* fs-private stuff */
struct inode vfs_inode;
};
static inline struct foo_inode_info *FOO_I(struct inode *inode)
{
return list_entry(inode, struct foo_inode_info, vfs_inode);
}
Use FOO_I(inode) instead of &inode->u.foo_inode_i;
Add foo_alloc_inode() and foo_destory_inode() - the former should allocate
foo_inode_info and return the address of ->vfs_inode, the latter should free
FOO_I(inode) (see in-tree filesystems for examples).
Make them ->alloc_inode and ->destroy_inode in your super_operations.
Keep in mind that now you need explicit initialization of private data -
typically in ->read_inode() and after getting an inode from new_inode().
At some point that will become mandatory.
---
[mandatory]
Change of file_system_type method (->read_super to ->get_sb)
->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
Turn your foo_read_super() into a function that would return 0 in case of
success and negative number in case of error (-EINVAL unless you have more
informative error value to report). Call it foo_fill_super(). Now declare
struct super_block foo_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
}
(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
filesystem).
Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
foo_get_sb.
---
[mandatory]
Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
Most likely there is no need to change anything, but if you relied on
global exclusion between renames for some internal purpose - you need to
change your internal locking. Otherwise exclusion warranties remain the
same (i.e. parents and victim are locked, etc.).
---
[informational]
Now we have the exclusion between ->lookup() and directory removal (by
->rmdir() and ->rename()). If you used to need that exclusion and do
it by internal locking (most of filesystems couldn't care less) - you
can relax your locking.
---
[mandatory]
->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
and ->readdir() are called without BKL now. Grab it on entry, drop upon return
- that will guarantee the same locking you used to have. If your method or its
parts do not need BKL - better yet, now you can shift lock_kernel() and
unlock_kernel() so that they would protect exactly what needs to be
protected.
---
[mandatory]
BKL is also moved from around sb operations. ->write_super() Is now called
without BKL held. BKL should have been shifted into individual fs sb_op
functions. If you don't need it, remove it.
---
[informational]
check for ->link() target not being a directory is done by callers. Feel
free to drop it...
---
[informational]
->link() callers hold ->i_sem on the object we are linking to. Some of your
problems might be over...
---
[mandatory]
new file_system_type method - kill_sb(superblock). If you are converting
an existing filesystem, set it according to ->fs_flags:
FS_REQUIRES_DEV - kill_block_super
FS_LITTER - kill_litter_super
neither - kill_anon_super
FS_LITTER is gone - just remove it from fs_flags.
---
[mandatory]
FS_SINGLE is gone (actually, that had happened back when ->get_sb()
went in - and hadn't been documented ;-/). Just remove it from fs_flags
(and see ->get_sb() entry for other actions).
---
[mandatory]
->setattr() is called without BKL now. Caller _always_ holds ->i_sem, so
watch for ->i_sem-grabbing code that might be used by your ->setattr().
Callers of notify_change() need ->i_sem now.
---
[recommended]
New super_block field "struct export_operations *s_export_op" for
explicit support for exporting, e.g. via NFS. The structure is fully
documented at its declaration in include/linux/fs.h, and in
Documentation/filesystems/Exporting.
Briefly it allows for the definition of decode_fh and encode_fh operations
to encode and decode filehandles, and allows the filesystem to use
a standard helper function for decode_fh, and provide file-system specific
support for this helper, particularly get_parent.
It is planned that this will be required for exporting once the code
settles down a bit.
[mandatory]
s_export_op is now required for exporting a filesystem.
isofs, ext2, ext3, resierfs, fat
can be used as examples of very different filesystems.
---
[mandatory]
iget4() and the read_inode2 callback have been superseded by iget5_locked()
which has the following prototype,
struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *),
void *data);
'test' is an additional function that can be used when the inode
number is not sufficient to identify the actual file object. 'set'
should be a non-blocking function that initializes those parts of a
newly created inode to allow the test function to succeed. 'data' is
passed as an opaque value to both test and set functions.
When the inode has been created by iget5_locked(), it will be returned with
the I_NEW flag set and will still be locked. read_inode has not been
called so the file system still has to finalize the initialization. Once
the inode is initialized it must be unlocked by calling unlock_new_inode().
The filesystem is responsible for setting (and possibly testing) i_ino
when appropriate. There is also a simpler iget_locked function that
just takes the superblock and inode number as arguments and does the
test and set for you.
e.g.
inode = iget_locked(sb, ino);
if (inode->i_state & I_NEW) {
read_inode_from_disk(inode);
unlock_new_inode(inode);
}
---
[recommended]
->getattr() finally getting used. See instances in nfs, minix, etc.
---
[mandatory]
->revalidate() is gone. If your filesystem had it - provide ->getattr()
and let it call whatever you had as ->revlidate() + (for symlinks that
had ->revalidate()) add calls in ->follow_link()/->readlink().
---
[mandatory]
->d_parent changes are not protected by BKL anymore. Read access is safe
if at least one of the following is true:
* filesystem has no cross-directory rename()
* dcache_lock is held
* we know that parent had been locked (e.g. we are looking at
->d_parent of ->lookup() argument).
* we are called from ->rename().
* the child's ->d_lock is held
Audit your code and add locking if needed. Notice that any place that is
not protected by the conditions above is risky even in the old tree - you
had been relying on BKL and that's prone to screwups. Old tree had quite
a few holes of that kind - unprotected access to ->d_parent leading to
anything from oops to silent memory corruption.
---
[mandatory]
FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
(see rootfs for one kind of solution and bdev/socket/pipe for another).
---
[recommended]
Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
is still alive, but only because of the mess in drivers/s390/block/dasd.c.
As soon as it gets fixed is_read_only() will die.
---
[mandatory]
->permission() is called without BKL now. Grab it on entry, drop upon
return - that will guarantee the same locking you used to have. If
your method or its parts do not need BKL - better yet, now you can
shift lock_kernel() and unlock_kernel() so that they would protect
exactly what needs to be protected.
---
[mandatory]
->statfs() is now called without BKL held. BKL should have been
shifted into individual fs sb_op functions where it's not clear that
it's safe to remove it. If you don't need it, remove it.
---
[mandatory]
is_read_only() is gone; use bdev_read_only() instead.
---
[mandatory]
destroy_buffers() is gone; use invalidate_bdev().
---
[mandatory]
fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
deliberate; as soon as struct block_device * is propagated in a reasonable
way by that code fixing will become trivial; until then nothing can be
done.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,187 @@
ROMFS - ROM FILE SYSTEM
This is a quite dumb, read only filesystem, mainly for initial RAM
disks of installation disks. It has grown up by the need of having
modules linked at boot time. Using this filesystem, you get a very
similar feature, and even the possibility of a small kernel, with a
file system which doesn't take up useful memory from the router
functions in the basement of your office.
For comparison, both the older minix and xiafs (the latter is now
defunct) filesystems, compiled as module need more than 20000 bytes,
while romfs is less than a page, about 4000 bytes (assuming i586
code). Under the same conditions, the msdos filesystem would need
about 30K (and does not support device nodes or symlinks), while the
nfs module with nfsroot is about 57K. Furthermore, as a bit unfair
comparison, an actual rescue disk used up 3202 blocks with ext2, while
with romfs, it needed 3079 blocks.
To create such a file system, you'll need a user program named
genromfs. It is available via anonymous ftp on sunsite.unc.edu and
its mirrors, in the /pub/Linux/system/recovery/ directory.
As the name suggests, romfs could be also used (space-efficiently) on
various read-only media, like (E)EPROM disks if someone will have the
motivation.. :)
However, the main purpose of romfs is to have a very small kernel,
which has only this filesystem linked in, and then can load any module
later, with the current module utilities. It can also be used to run
some program to decide if you need SCSI devices, and even IDE or
floppy drives can be loaded later if you use the "initrd"--initial
RAM disk--feature of the kernel. This would not be really news
flash, but with romfs, you can even spare off your ext2 or minix or
maybe even affs filesystem until you really know that you need it.
For example, a distribution boot disk can contain only the cd disk
drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem
module. The kernel can be small enough, since it doesn't have other
filesystems, like the quite large ext2fs module, which can then be
loaded off the CD at a later stage of the installation. Another use
would be for a recovery disk, when you are reinstalling a workstation
from the network, and you will have all the tools/modules available
from a nearby server, so you don't want to carry two disks for this
purpose, just because it won't fit into ext2.
romfs operates on block devices as you can expect, and the underlying
structure is very simple. Every accessible structure begins on 16
byte boundaries for fast access. The minimum space a file will take
is 32 bytes (this is an empty file, with a less than 16 character
name). The maximum overhead for any non-empty file is the header, and
the 16 byte padding for the name and the contents, also 16+14+15 = 45
bytes. This is quite rare however, since most file names are longer
than 3 bytes, and shorter than 15 bytes.
The layout of the filesystem is the following:
offset content
+---+---+---+---+
0 | - | r | o | m | \
+---+---+---+---+ The ASCII representation of those bytes
4 | 1 | f | s | - | / (i.e. "-rom1fs-")
+---+---+---+---+
8 | full size | The number of accessible bytes in this fs.
+---+---+---+---+
12 | checksum | The checksum of the FIRST 512 BYTES.
+---+---+---+---+
16 | volume name | The zero terminated name of the volume,
: : padded to 16 byte boundary.
+---+---+---+---+
xx | file |
: headers :
Every multi byte value (32 bit words, I'll use the longwords term from
now on) must be in big endian order.
The first eight bytes identify the filesystem, even for the casual
inspector. After that, in the 3rd longword, it contains the number of
bytes accessible from the start of this filesystem. The 4th longword
is the checksum of the first 512 bytes (or the number of bytes
accessible, whichever is smaller). The applied algorithm is the same
as in the AFFS filesystem, namely a simple sum of the longwords
(assuming bigendian quantities again). For details, please consult
the source. This algorithm was chosen because although it's not quite
reliable, it does not require any tables, and it is very simple.
The following bytes are now part of the file system; each file header
must begin on a 16 byte boundary.
offset content
+---+---+---+---+
0 | next filehdr|X| The offset of the next file header
+---+---+---+---+ (zero if no more files)
4 | spec.info | Info for directories/hard links/devices
+---+---+---+---+
8 | size | The size of this file in bytes
+---+---+---+---+
12 | checksum | Covering the meta data, including the file
+---+---+---+---+ name, and padding
16 | file name | The zero terminated name of the file,
: : padded to 16 byte boundary
+---+---+---+---+
xx | file data |
: :
Since the file headers begin always at a 16 byte boundary, the lowest
4 bits would be always zero in the next filehdr pointer. These four
bits are used for the mode information. Bits 0..2 specify the type of
the file; while bit 4 shows if the file is executable or not. The
permissions are assumed to be world readable, if this bit is not set,
and world executable if it is; except the character and block devices,
they are never accessible for other than owner. The owner of every
file is user and group 0, this should never be a problem for the
intended use. The mapping of the 8 possible values to file types is
the following:
mapping spec.info means
0 hard link link destination [file header]
1 directory first file's header
2 regular file unused, must be zero [MBZ]
3 symbolic link unused, MBZ (file data is the link content)
4 block device 16/16 bits major/minor number
5 char device - " -
6 socket unused, MBZ
7 fifo unused, MBZ
Note that hard links are specifically marked in this filesystem, but
they will behave as you can expect (i.e. share the inode number).
Note also that it is your responsibility to not create hard link
loops, and creating all the . and .. links for directories. This is
normally done correctly by the genromfs program. Please refrain from
using the executable bits for special purposes on the socket and fifo
special files, they may have other uses in the future. Additionally,
please remember that only regular files, and symlinks are supposed to
have a nonzero size field; they contain the number of bytes available
directly after the (padded) file name.
Another thing to note is that romfs works on file headers and data
aligned to 16 byte boundaries, but most hardware devices and the block
device drivers are unable to cope with smaller than block-sized data.
To overcome this limitation, the whole size of the file system must be
padded to an 1024 byte boundary.
If you have any problems or suggestions concerning this file system,
please contact me. However, think twice before wanting me to add
features and code, because the primary and most important advantage of
this file system is the small code. On the other hand, don't be
alarmed, I'm not getting that much romfs related mail. Now I can
understand why Avery wrote poems in the ARCnet docs to get some more
feedback. :)
romfs has also a mailing list, and to date, it hasn't received any
traffic, so you are welcome to join it to discuss your ideas. :)
It's run by ezmlm, so you can subscribe to it by sending a message
to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
Pending issues:
- Permissions and owner information are pretty essential features of a
Un*x like system, but romfs does not provide the full possibilities.
I have never found this limiting, but others might.
- The file system is read only, so it can be very small, but in case
one would want to write _anything_ to a file system, he still needs
a writable file system, thus negating the size advantages. Possible
solutions: implement write access as a compile-time option, or a new,
similarly small writable filesystem for RAM disks.
- Since the files are only required to have alignment on a 16 byte
boundary, it is currently possibly suboptimal to read or execute files
from the filesystem. It might be resolved by reordering file data to
have most of it (i.e. except the start and the end) laying at "natural"
boundaries, thus it would be possible to directly map a big portion of
the file contents to the mm subsystem.
- Compression might be an useful feature, but memory is quite a
limiting factor in my eyes.
- Where it is used?
- Does it work on other architectures than intel and motorola?
Have fun,
Janos Farkas <chexum@shadow.banki.hu>

View File

@@ -0,0 +1,8 @@
Smbfs is a filesystem that implements the SMB protocol, which is the
protocol used by Windows for Workgroups, Windows 95 and Windows NT.
Smbfs was inspired by Samba, the program written by Andrew Tridgell
that turns any Unix host into a file server for DOS or Windows clients.
Smbfs is a SMB client, but uses parts of samba for it's operation. For
more info on samba, including documentation, please go to
http://www.samba.org/ and then on to your nearest mirror.

View File

@@ -0,0 +1,88 @@
Accessing PCI device resources through sysfs
sysfs, usually mounted at /sys, provides access to PCI resources on platforms
that support it. For example, a given bus might look like this:
/sys/devices/pci0000:17
|-- 0000:17:00.0
| |-- class
| |-- config
| |-- detach_state
| |-- device
| |-- irq
| |-- local_cpus
| |-- resource
| |-- resource0
| |-- resource1
| |-- resource2
| |-- rom
| |-- subsystem_device
| |-- subsystem_vendor
| `-- vendor
`-- detach_state
The topmost element describes the PCI domain and bus number. In this case,
the domain number is 0000 and the bus number is 17 (both values are in hex).
This bus contains a single function device in slot 0. The domain and bus
numbers are reproduced for convenience. Under the device directory are several
files, each with their own function.
file function
---- --------
class PCI class (ascii, ro)
config PCI config space (binary, rw)
detach_state connection status (bool, rw)
device PCI device (ascii, ro)
irq IRQ number (ascii, ro)
local_cpus nearby CPU mask (cpumask, ro)
resource PCI resource host addresses (ascii, ro)
resource0..N PCI resource N, if present (binary, mmap)
rom PCI ROM resource, if present (binary, ro)
subsystem_device PCI subsystem device (ascii, ro)
subsystem_vendor PCI subsystem vendor (ascii, ro)
vendor PCI vendor (ascii, ro)
ro - read only file
rw - file is readable and writable
mmap - file is mmapable
ascii - file contains ascii text
binary - file contains binary data
cpumask - file contains a cpumask type
The read only files are informational, writes to them will be ignored.
Writable files can be used to perform actions on the device (e.g. changing
config space, detaching a device). mmapable files are available via an
mmap of the file at offset 0 and can be used to do actual device programming
from userspace. Note that some platforms don't support mmapping of certain
resources, so be sure to check the return value from any attempted mmap.
Accessing legacy resources through sysfs
Legacy I/O port and ISA memory resources are also provided in sysfs if the
underlying platform supports them. They're located in the PCI class heirarchy,
e.g.
/sys/class/pci_bus/0000:17/
|-- bridge -> ../../../devices/pci0000:17
|-- cpuaffinity
|-- legacy_io
`-- legacy_mem
The legacy_io file is a read/write file that can be used by applications to
do legacy port I/O. The application should open the file, seek to the desired
port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
file should be mmapped with an offset corresponding to the memory offset
desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
simply dereference the returned pointer (after checking for errors of course)
to access legacy memory space.
Supporting PCI access on new platforms
In order to support PCI resource mapping as described above, Linux platform
code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
Platforms are free to only support subsets of the mmap functionality, but
useful return codes should be provided.
Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
wishing to support legacy functionality should define it and provide
pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.

View File

@@ -0,0 +1,341 @@
sysfs - _The_ filesystem for exporting kernel objects.
Patrick Mochel <mochel@osdl.org>
10 January 2003
What it is:
~~~~~~~~~~~
sysfs is a ram-based filesystem initially based on ramfs. It provides
a means to export kernel data structures, their attributes, and the
linkages between them to userspace.
sysfs is tied inherently to the kobject infrastructure. Please read
Documentation/kobject.txt for more information concerning the kobject
interface.
Using sysfs
~~~~~~~~~~~
sysfs is always compiled in. You can access it by doing:
mount -t sysfs sysfs /sys
Directory Creation
~~~~~~~~~~~~~~~~~~
For every kobject that is registered with the system, a directory is
created for it in sysfs. That directory is created as a subdirectory
of the kobject's parent, expressing internal object hierarchies to
userspace. Top-level directories in sysfs represent the common
ancestors of object hierarchies; i.e. the subsystems the objects
belong to.
Sysfs internally stores the kobject that owns the directory in the
->d_fsdata pointer of the directory's dentry. This allows sysfs to do
reference counting directly on the kobject when the file is opened and
closed.
Attributes
~~~~~~~~~~
Attributes can be exported for kobjects in the form of regular files in
the filesystem. Sysfs forwards file I/O operations to methods defined
for the attributes, providing a means to read and write kernel
attributes.
Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only
value per file, so it is socially acceptable to express an array of
values of the same type.
Mixing types, expressing multiple lines of data, and doing fancy
formatting of data is heavily frowned upon. Doing these things may get
you publically humiliated and your code rewritten without notice.
An attribute definition is simply:
struct attribute {
char * name;
mode_t mode;
};
int sysfs_create_file(struct kobject * kobj, struct attribute * attr);
void sysfs_remove_file(struct kobject * kobj, struct attribute * attr);
A bare attribute contains no means to read or write the value of the
attribute. Subsystems are encouraged to define their own attribute
structure and wrapper functions for adding and removing attributes for
a specific object type.
For example, the driver model defines struct device_attribute like:
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device * dev, char * buf);
ssize_t (*store)(struct device * dev, const char * buf);
};
int device_create_file(struct device *, struct device_attribute *);
void device_remove_file(struct device *, struct device_attribute *);
It also defines this helper for defining device attributes:
#define DEVICE_ATTR(_name,_mode,_show,_store) \
struct device_attribute dev_attr_##_name = { \
.attr = {.name = __stringify(_name) , .mode = _mode }, \
.show = _show, \
.store = _store, \
};
For example, declaring
static DEVICE_ATTR(foo,0644,show_foo,store_foo);
is equivalent to doing:
static struct device_attribute dev_attr_foo = {
.attr = {
.name = "foo",
.mode = 0644,
},
.show = show_foo,
.store = store_foo,
};
Subsystem-Specific Callbacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a subsystem defines a new attribute type, it must implement a
set of sysfs operations for forwarding read and write calls to the
show and store methods of the attribute owners.
struct sysfs_ops {
ssize_t (*show)(struct kobject *, struct attribute *,char *);
ssize_t (*store)(struct kobject *,struct attribute *,const char *);
};
[ Subsystems should have already defined a struct kobj_type as a
descriptor for this type, which is where the sysfs_ops pointer is
stored. See the kobject documentation for more information. ]
When a file is read or written, sysfs calls the appropriate method
for the type. The method then translates the generic struct kobject
and struct attribute pointers to the appropriate pointer types, and
calls the associated methods.
To illustrate:
#define to_dev_attr(_attr) container_of(_attr,struct device_attribute,attr)
#define to_dev(d) container_of(d, struct device, kobj)
static ssize_t
dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
{
struct device_attribute * dev_attr = to_dev_attr(attr);
struct device * dev = to_dev(kobj);
ssize_t ret = 0;
if (dev_attr->show)
ret = dev_attr->show(dev,buf);
return ret;
}
Reading/Writing Attribute Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To read or write attributes, show() or store() methods must be
specified when declaring the attribute. The method types should be as
simple as those defined for device attributes:
ssize_t (*show)(struct device * dev, char * buf);
ssize_t (*store)(struct device * dev, const char * buf);
IOW, they should take only an object and a buffer as parameters.
sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
method. Sysfs will call the method exactly once for each read or
write. This forces the following behavior on the method
implementations:
- On read(2), the show() method should fill the entire buffer.
Recall that an attribute should only be exporting one value, or an
array of similar values, so this shouldn't be that expensive.
This allows userspace to do partial reads and seeks arbitrarily over
the entire file at will.
- On write(2), sysfs expects the entire buffer to be passed during the
first write. Sysfs then passes the entire buffer to the store()
method.
When writing sysfs files, userspace processes should first read the
entire file, modify the values it wishes to change, then write the
entire buffer back.
Attribute method implementations should operate on an identical
buffer when reading and writing values.
Other notes:
- The buffer will always be PAGE_SIZE bytes in length. On i386, this
is 4096.
- show() methods should return the number of bytes printed into the
buffer. This is the return value of snprintf().
- show() should always use snprintf().
- store() should return the number of bytes used from the buffer. This
can be done using strlen().
- show() or store() can always return errors. If a bad value comes
through, be sure to return an error.
- The object passed to the methods will be pinned in memory via sysfs
referencing counting its embedded object. However, the physical
entity (e.g. device) the object represents may not be present. Be
sure to have a way to check this, if necessary.
A very simple (and naive) implementation of a device attribute is:
static ssize_t show_name(struct device * dev, char * buf)
{
return sprintf(buf,"%s\n",dev->name);
}
static ssize_t store_name(struct device * dev, const char * buf)
{
sscanf(buf,"%20s",dev->name);
return strlen(buf);
}
static DEVICE_ATTR(name,S_IRUGO,show_name,store_name);
(Note that the real implementation doesn't allow userspace to set the
name for a device.)
Top Level Directory Layout
~~~~~~~~~~~~~~~~~~~~~~~~~~
The sysfs directory arrangement exposes the relationship of kernel
data structures.
The top level sysfs diretory looks like:
block/
bus/
class/
devices/
firmware/
net/
devices/ contains a filesystem representation of the device tree. It maps
directly to the internal kernel device tree, which is a hierarchy of
struct device.
bus/ contains flat directory layout of the various bus types in the
kernel. Each bus's directory contains two subdirectories:
devices/
drivers/
devices/ contains symlinks for each device discovered in the system
that point to the device's directory under root/.
drivers/ contains a directory for each device driver that is loaded
for devices on that particular bus (this assumes that drivers do not
span multiple bus types).
More information can driver-model specific features can be found in
Documentation/driver-model/.
TODO: Finish this section.
Current Interfaces
~~~~~~~~~~~~~~~~~~
The following interface layers currently exist in sysfs:
- devices (include/linux/device.h)
----------------------------------
Structure:
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device * dev, char * buf);
ssize_t (*store)(struct device * dev, const char * buf);
};
Declaring:
DEVICE_ATTR(_name,_str,_mode,_show,_store);
Creation/Removal:
int device_create_file(struct device *device, struct device_attribute * attr);
void device_remove_file(struct device * dev, struct device_attribute * attr);
- bus drivers (include/linux/device.h)
--------------------------------------
Structure:
struct bus_attribute {
struct attribute attr;
ssize_t (*show)(struct bus_type *, char * buf);
ssize_t (*store)(struct bus_type *, const char * buf);
};
Declaring:
BUS_ATTR(_name,_mode,_show,_store)
Creation/Removal:
int bus_create_file(struct bus_type *, struct bus_attribute *);
void bus_remove_file(struct bus_type *, struct bus_attribute *);
- device drivers (include/linux/device.h)
-----------------------------------------
Structure:
struct driver_attribute {
struct attribute attr;
ssize_t (*show)(struct device_driver *, char * buf);
ssize_t (*store)(struct device_driver *, const char * buf);
};
Declaring:
DRIVER_ATTR(_name,_mode,_show,_store)
Creation/Removal:
int driver_create_file(struct device_driver *, struct driver_attribute *);
void driver_remove_file(struct device_driver *, struct driver_attribute *);

View File

@@ -0,0 +1,38 @@
This is the implementation of the SystemV/Coherent filesystem for Linux.
It implements all of
- Xenix FS,
- SystemV/386 FS,
- Coherent FS.
This is version beta 4.
To install:
* Answer the 'System V and Coherent filesystem support' question with 'y'
when configuring the kernel.
* To mount a disk or a partition, use
mount [-r] -t sysv device mountpoint
The file system type names
-t sysv
-t xenix
-t coherent
may be used interchangeably, but the last two will eventually disappear.
Bugs in the present implementation:
- Coherent FS:
- The "free list interleave" n:m is currently ignored.
- Only file systems with no filesystem name and no pack name are recognized.
(See Coherent "man mkfs" for a description of these features.)
- SystemV Release 2 FS:
The superblock is only searched in the blocks 9, 15, 18, which
corresponds to the beginning of track 1 on floppy disks. No support
for this FS on hard disk yet.
Please report any bugs and suggestions to
Bruno Haible <haible@ma2s2.mathematik.uni-karlsruhe.de>
Pascal Haible <haible@izfm.uni-stuttgart.de>
Krzysztof G. Baranowski <kgb@manjak.knm.org.pl>
Bruno Haible
<haible@ma2s2.mathematik.uni-karlsruhe.de>

View File

@@ -0,0 +1,100 @@
Tmpfs is a file system which keeps all files in virtual memory.
Everything in tmpfs is temporary in the sense that no files will be
created on your hard drive. If you unmount a tmpfs instance,
everything stored therein is lost.
tmpfs puts everything into the kernel internal caches and grows and
shrinks to accommodate the files it contains and is able to swap
unneeded pages out to swap space. It has maximum size limits which can
be adjusted on the fly via 'mount -o remount ...'
If you compare it to ramfs (which was the template to create tmpfs)
you gain swapping and limit checking. Another similar thing is the RAM
disk (/dev/ram*), which simulates a fixed size hard disk in physical
RAM, where you have to create an ordinary filesystem on top. Ramdisks
cannot swap and you do not have the possibility to resize them.
Since tmpfs lives completely in the page cache and on swap, all tmpfs
pages currently in memory will show up as cached. It will not show up
as shared or something like that. Further on you can check the actual
RAM+swap use of a tmpfs instance with df(1) and du(1).
tmpfs has the following uses:
1) There is always a kernel internal mount which you will not see at
all. This is used for shared anonymous mappings and SYSV shared
memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
set, the user visible part of tmpfs is not build. But the internal
mechanisms are always present.
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
POSIX shared memory (shm_open, shm_unlink). Adding the following
line to /etc/fstab should take care of this:
tmpfs /dev/shm tmpfs defaults 0 0
Remember to create the directory that you intend to mount tmpfs on
if necessary (/dev/shm is automagically created if you use devfs).
This mount is _not_ needed for SYSV shared memory. The internal
mount is used for that. (In the 2.3 kernel versions it was
necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
shared memory)
3) Some people (including me) find it very convenient to mount it
e.g. on /tmp and /var/tmp and have a big swap partition. And now
loop mounts of tmpfs files do work, so mkinitrd shipped by most
distributions should succeed with a tmpfs /tmp.
4) And probably a lot more I do not know about :-)
tmpfs has three mount options for sizing:
size: The limit of allocated bytes for this tmpfs instance. The
default is half of your physical RAM without swap. If you
oversize your tmpfs instances the machine will deadlock
since the OOM handler will not be able to free that memory.
nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
nr_inodes: The maximum number of inodes for this instance. The default
is half of the number of your physical RAM pages, or (on a
a machine with highmem) the number of lowmem RAM pages,
whichever is the lower.
These parameters accept a suffix k, m or g for kilo, mega and giga and
can be changed on remount. The size parameter also accepts a suffix %
to limit this tmpfs instance to that percentage of your physical RAM:
the default, when neither size nor nr_blocks is specified, is size=50%
If both nr_blocks (or size) and nr_inodes are set to 0, neither blocks
nor inodes will be limited in that instance. It is generally unwise to
mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
that instance in a system with many cpus making intensive use of it.
To specify the initial root directory you can use the following mount
options:
mode: The permissions as an octal number
uid: The user id
gid: The group id
These options do not have any effect on remount. You can change these
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
Author:
Christoph Rohland <cr@sap.com>, 1.12.01
Updated:
Hugh Dickins <hugh@veritas.com>, 01 September 2004

View File

@@ -0,0 +1,57 @@
*
* Documentation/filesystems/udf.txt
*
UDF Filesystem version 0.9.8.1
If you encounter problems with reading UDF discs using this driver,
please report them to linux_udf@hpesjro.fc.hp.com, which is the
developer's list.
Write support requires a block driver which supports writing. The current
scsi and ide cdrom drivers do not support writing.
-------------------------------------------------------------------------------
The following mount options are supported:
gid= Set the default group.
umask= Set the default umask.
uid= Set the default user.
bs= Set the block size.
unhide Show otherwise hidden files.
undelete Show deleted files in lists.
adinicb Embed data in the inode (default)
noadinicb Don't embed data in the inode
shortad Use short ad's
longad Use long ad's (default)
nostrict Unset strict conformance
iocharset= Set the NLS character set
The remaining are for debugging and disaster recovery:
novrs Skip volume sequence recognition
The following expect a offset from 0.
session= Set the CDROM session (default= last session)
anchor= Override standard anchor location. (default= 256)
volume= Override the VolumeDesc location. (unused)
partition= Override the PartitionDesc location. (unused)
lastblock= Set the last block of the filesystem/
The following expect a offset from the partition root.
fileset= Override the fileset block location. (unused)
rootdir= Override the root directory location. (unused)
WARNING: overriding the rootdir to a non-directory may
yield highly unpredictable results.
-------------------------------------------------------------------------------
For the latest version and toolset see:
http://linux-udf.sourceforge.net/
Documentation on UDF and ECMA 167 is available FREE from:
http://www.osta.org/
http://www.ecma-international.org/
Ben Fennema <bfennema@falcon.csc.calpoly.edu>

View File

@@ -0,0 +1,61 @@
USING UFS
=========
mount -t ufs -o ufstype=type_of_ufs device dir
UFS OPTIONS
===========
ufstype=type_of_ufs
UFS is a file system widely used in different operating systems.
The problem are differences among implementations. Features of
some implementations are undocumented, so its hard to recognize
type of ufs automatically. That's why user must specify type of
ufs manually by mount option ufstype. Possible values are:
old old format of ufs
default value, supported as read-only
44bsd used in FreeBSD, NetBSD, OpenBSD
supported as read-write
ufs2 used in FreeBSD 5.x
supported as read-only
5xbsd synonym for ufs2
sun used in SunOS (Solaris)
supported as read-write
sunx86 used in SunOS for Intel (Solarisx86)
supported as read-write
hp used in HP-UX
supported as read-only
nextstep
used in NextStep
supported as read-only
nextstep-cd
used for NextStep CDROMs (block_size == 2048)
supported as read-only
openstep
used in OpenStep
supported as read-only
POSSIBLE PROBLEMS
=================
There is still bug in reallocation of fragment, in file fs/ufs/balloc.c,
line 364. But it seems working on current buffer cache configuration.
BUG REPORTS
===========
Any ufs bug report you can send to daniel.pirkl@email.cz (do not send
partition tables bug reports.)

View File

@@ -0,0 +1,231 @@
USING VFAT
----------------------------------------------------------------------
To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
mount -t vfat /dev/fd0 /mnt
No special partition formatter is required. mkdosfs will work fine
if you want to format from within Linux.
VFAT MOUNT OPTIONS
----------------------------------------------------------------------
umask=### -- The permission mask (for files and directories, see umask(1)).
The default is the umask of current process.
dmask=### -- The permission mask for the directory.
The default is the umask of current process.
fmask=### -- The permission mask for files.
The default is the umask of current process.
codepage=### -- Sets the codepage number for converting to shortname
characters on FAT filesystem.
By default, FAT_DEFAULT_CODEPAGE setting is used.
iocharset=name -- Character set to use for converting between the
encoding is used for user visible filename and 16 bit
Unicode characters. Long filenames are stored on disk
in Unicode format, but Unix for the most part doesn't
know how to deal with Unicode.
By default, FAT_DEFAULT_IOCHARSET setting is used.
There is also an option of doing UTF8 translations
with the utf8 option.
NOTE: "iocharset=utf8" is not recommended. If unsure,
you should consider the following option instead.
utf8=<bool> -- UTF8 is the filesystem safe version of Unicode that
is used by the console. It can be be enabled for the
filesystem with this option. If 'uni_xlate' gets set,
UTF8 gets disabled.
uni_xlate=<bool> -- Translate unhandled Unicode characters to special
escaped sequences. This would let you backup and
restore filenames that are created with any Unicode
characters. Until Linux supports Unicode for real,
this gives you an alternative. Without this option,
a '?' is used when no translation is possible. The
escape character is ':' because it is otherwise
illegal on the vfat filesystem. The escape sequence
that gets used is ':' and the four digits of hexadecimal
unicode.
nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
end in '~1' or tilde followed by some number. If this
option is set, then if the filename is
"longfilename.txt" and "longfile.txt" does not
currently exist in the directory, 'longfile.txt' will
be the short alias instead of 'longfi~1.txt'.
quiet -- Stops printing certain warning messages.
check=s|r|n -- Case sensitivity checking setting.
s: strict, case sensitive
r: relaxed, case insensitive
n: normal, default setting, currently case insensitive
shortname=lower|win95|winnt|mixed
-- Shortname display/create setting.
lower: convert to lowercase for display,
emulate the Windows 95 rule for create.
win95: emulate the Windows 95 rule for display/create.
winnt: emulate the Windows NT rule for display/create.
mixed: emulate the Windows NT rule for display,
emulate the Windows 95 rule for create.
Default setting is `lower'.
<bool>: 0,1,yes,no,true,false
TODO
----------------------------------------------------------------------
* Need to get rid of the raw scanning stuff. Instead, always use
a get next directory entry approach. The only thing left that uses
raw scanning is the directory renaming code.
POSSIBLE PROBLEMS
----------------------------------------------------------------------
* vfat_valid_longname does not properly checked reserved names.
* When a volume name is the same as a directory name in the root
directory of the filesystem, the directory name sometimes shows
up as an empty file.
* autoconv option does not work correctly.
BUG REPORTS
----------------------------------------------------------------------
If you have trouble with the VFAT filesystem, mail bug reports to
chaffee@bmrc.cs.berkeley.edu. Please specify the filename
and the operation that gave you trouble.
TEST SUITE
----------------------------------------------------------------------
If you plan to make any modifications to the vfat filesystem, please
get the test suite that comes with the vfat distribution at
http://bmrc.berkeley.edu/people/chaffee/vfat.html
This tests quite a few parts of the vfat filesystem and additional
tests for new features or untested features would be appreciated.
NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
----------------------------------------------------------------------
(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
and lightly annotated by Gordon Chaffee).
This document presents a very rough, technical overview of my
knowledge of the extended FAT file system used in Windows NT 3.5 and
Windows 95. I don't guarantee that any of the following is correct,
but it appears to be so.
The extended FAT file system is almost identical to the FAT
file system used in DOS versions up to and including 6.223410239847
:-). The significant change has been the addition of long file names.
These names support up to 255 characters including spaces and lower
case characters as opposed to the traditional 8.3 short names.
Here is the description of the traditional FAT entry in the current
Windows 95 filesystem:
struct directory { // Short 8.3 names
unsigned char name[8]; // file name
unsigned char ext[3]; // file extension
unsigned char attr; // attribute byte
unsigned char lcase; // Case for base and extension
unsigned char ctime_ms; // Creation time, milliseconds
unsigned char ctime[2]; // Creation time
unsigned char cdate[2]; // Creation date
unsigned char adate[2]; // Last access date
unsigned char reserved[2]; // reserved values (ignored)
unsigned char time[2]; // time stamp
unsigned char date[2]; // date stamp
unsigned char start[2]; // starting cluster number
unsigned char size[4]; // size of the file
};
The lcase field specifies if the base and/or the extension of an 8.3
name should be capitalized. This field does not seem to be used by
Windows 95 but it is used by Windows NT. The case of filenames is not
completely compatible from Windows NT to Windows 95. It is not completely
compatible in the reverse direction, however. Filenames that fit in
the 8.3 namespace and are written on Windows NT to be lowercase will
show up as uppercase on Windows 95.
Note that the "start" and "size" values are actually little
endian integer values. The descriptions of the fields in this
structure are public knowledge and can be found elsewhere.
With the extended FAT system, Microsoft has inserted extra
directory entries for any files with extended names. (Any name which
legally fits within the old 8.3 encoding scheme does not have extra
entries.) I call these extra entries slots. Basically, a slot is a
specially formatted directory entry which holds up to 13 characters of
a file's extended name. Think of slots as additional labeling for the
directory entry of the file to which they correspond. Microsoft
prefers to refer to the 8.3 entry for a file as its alias and the
extended slot directory entries as the file name.
The C structure for a slot directory entry follows:
struct slot { // Up to 13 characters of a long name
unsigned char id; // sequence number for slot
unsigned char name0_4[10]; // first 5 characters in name
unsigned char attr; // attribute byte
unsigned char reserved; // always 0
unsigned char alias_checksum; // checksum for 8.3 alias
unsigned char name5_10[12]; // 6 more characters in name
unsigned char start[2]; // starting cluster number
unsigned char name11_12[4]; // last 2 characters in name
};
If the layout of the slots looks a little odd, it's only
because of Microsoft's efforts to maintain compatibility with old
software. The slots must be disguised to prevent old software from
panicking. To this end, a number of measures are taken:
1) The attribute byte for a slot directory entry is always set
to 0x0f. This corresponds to an old directory entry with
attributes of "hidden", "system", "read-only", and "volume
label". Most old software will ignore any directory
entries with the "volume label" bit set. Real volume label
entries don't have the other three bits set.
2) The starting cluster is always set to 0, an impossible
value for a DOS file.
Because the extended FAT system is backward compatible, it is
possible for old software to modify directory entries. Measures must
be taken to ensure the validity of slots. An extended FAT system can
verify that a slot does in fact belong to an 8.3 directory entry by
the following:
1) Positioning. Slots for a file always immediately proceed
their corresponding 8.3 directory entry. In addition, each
slot has an id which marks its order in the extended file
name. Here is a very abbreviated view of an 8.3 directory
entry and its corresponding long name slots for the file
"My Big File.Extension which is long":
<proceeding files...>
<slot #3, id = 0x43, characters = "h is long">
<slot #2, id = 0x02, characters = "xtension whic">
<slot #1, id = 0x01, characters = "My Big File.E">
<directory entry, name = "MYBIGFIL.EXT">
Note that the slots are stored from last to first. Slots
are numbered from 1 to N. The Nth slot is or'ed with 0x40
to mark it as the last one.
2) Checksum. Each slot has an "alias_checksum" value. The
checksum is calculated from the 8.3 name using the
following algorithm:
for (sum = i = 0; i < 11; i++) {
sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
}
3) If there is free space in the final slot, a Unicode NULL (0x0000)
is stored after the final character. After that, all unused
characters in the final slot are set to Unicode 0xFFFF.
Finally, note that the extended name is stored in Unicode. Each Unicode
character takes two bytes.

View File

@@ -0,0 +1,671 @@
/* -*- auto-fill -*- */
Overview of the Virtual File System
Richard Gooch <rgooch@atnf.csiro.au>
5-JUL-1999
Conventions used in this document <section>
=================================
Each section in this document will have the string "<section>" at the
right-hand side of the section title. Each subsection will have
"<subsection>" at the right-hand side. These strings are meant to make
it easier to search through the document.
NOTE that the master copy of this document is available online at:
http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
What is it? <section>
===========
The Virtual File System (otherwise known as the Virtual Filesystem
Switch) is the software layer in the kernel that provides the
filesystem interface to userspace programs. It also provides an
abstraction within the kernel which allows different filesystem
implementations to co-exist.
A Quick Look At How It Works <section>
============================
In this section I'll briefly describe how things work, before
launching into the details. I'll start with describing what happens
when user programs open and manipulate files, and then look from the
other view which is how a filesystem is supported and subsequently
mounted.
Opening a File <subsection>
--------------
The VFS implements the open(2), stat(2), chmod(2) and similar system
calls. The pathname argument is used by the VFS to search through the
directory entry cache (dentry cache or "dcache"). This provides a very
fast look-up mechanism to translate a pathname (filename) into a
specific dentry.
An individual dentry usually has a pointer to an inode. Inodes are the
things that live on disc drives, and can be regular files (you know:
those things that you write data into), directories, FIFOs and other
beasts. Dentries live in RAM and are never saved to disc: they exist
only for performance. Inodes live on disc and are copied into memory
when required. Later any changes are written back to disc. The inode
that lives in RAM is a VFS inode, and it is this which the dentry
points to. A single inode can be pointed to by multiple dentries
(think about hardlinks).
The dcache is meant to be a view into your entire filespace. Unlike
Linus, most of us losers can't fit enough dentries into RAM to cover
all of our filespace, so the dcache has bits missing. In order to
resolve your pathname into a dentry, the VFS may have to resort to
creating dentries along the way, and then loading the inode. This is
done by looking up the inode.
To look up an inode (usually read from disc) requires that the VFS
calls the lookup() method of the parent directory inode. This method
is installed by the specific filesystem implementation that the inode
lives in. There will be more on this later.
Once the VFS has the required dentry (and hence the inode), we can do
all those boring things like open(2) the file, or stat(2) it to peek
at the inode data. The stat(2) operation is fairly simple: once the
VFS has the dentry, it peeks at the inode data and passes some of it
back to userspace.
Opening a file requires another operation: allocation of a file
structure (this is the kernel-side implementation of file
descriptors). The freshly allocated file structure is initialised with
a pointer to the dentry and a set of file operation member functions.
These are taken from the inode data. The open() file method is then
called so the specific filesystem implementation can do it's work. You
can see that this is another switch performed by the VFS.
The file structure is placed into the file descriptor table for the
process.
Reading, writing and closing files (and other assorted VFS operations)
is done by using the userspace file descriptor to grab the appropriate
file structure, and then calling the required file structure method
function to do whatever is required.
For as long as the file is open, it keeps the dentry "open" (in use),
which in turn means that the VFS inode is still in use.
All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
chmod(2) and so on) are called from a process context. You should
assume that these calls are made without any kernel locks being
held. This means that the processes may be executing the same piece of
filesystem or driver code at the same time, on different
processors. You should ensure that access to shared resources is
protected by appropriate locks.
Registering and Mounting a Filesystem <subsection>
-------------------------------------
If you want to support a new kind of filesystem in the kernel, all you
need to do is call register_filesystem(). You pass a structure
describing the filesystem implementation (struct file_system_type)
which is then added to an internal table of supported filesystems. You
can do:
% cat /proc/filesystems
to see what filesystems are currently available on your system.
When a request is made to mount a block device onto a directory in
your filespace the VFS will call the appropriate method for the
specific filesystem. The dentry for the mount point will then be
updated to point to the root inode for the new filesystem.
It's now time to look at things in more detail.
struct file_system_type <section>
=======================
This describes the filesystem. As of kernel 2.1.99, the following
members are defined:
struct file_system_type {
const char *name;
int fs_flags;
struct super_block *(*read_super) (struct super_block *, void *, int);
struct file_system_type * next;
};
name: the name of the filesystem type, such as "ext2", "iso9660",
"msdos" and so on
fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
read_super: the method to call when a new instance of this
filesystem should be mounted
next: for internal VFS use: you should initialise this to NULL
The read_super() method has the following arguments:
struct super_block *sb: the superblock structure. This is partially
initialised by the VFS and the rest must be initialised by the
read_super() method
void *data: arbitrary mount options, usually comes as an ASCII
string
int silent: whether or not to be silent on error
The read_super() method must determine if the block device specified
in the superblock contains a filesystem of the type the method
supports. On success the method returns the superblock pointer, on
failure it returns NULL.
The most interesting member of the superblock structure that the
read_super() method fills in is the "s_op" field. This is a pointer to
a "struct super_operations" which describes the next level of the
filesystem implementation.
struct super_operations <section>
=======================
This describes how the VFS can manipulate the superblock of your
filesystem. As of kernel 2.1.99, the following members are defined:
struct super_operations {
void (*read_inode) (struct inode *);
int (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
int (*notify_change) (struct dentry *, struct iattr *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*statfs) (struct super_block *, struct statfs *, int);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
};
All methods are called without any locks being held, unless otherwise
noted. This means that most methods can block safely. All methods are
only called from a process context (i.e. not from an interrupt handler
or bottom half).
read_inode: this method is called to read a specific inode from the
mounted filesystem. The "i_ino" member in the "struct inode"
will be initialised by the VFS to indicate which inode to
read. Other members are filled in by this method
write_inode: this method is called when the VFS needs to write an
inode to disc. The second parameter indicates whether the write
should be synchronous or not, not all filesystems check this flag.
put_inode: called when the VFS inode is removed from the inode
cache. This method is optional
drop_inode: called when the last access to the inode is dropped,
with the inode_lock spinlock held.
This method should be either NULL (normal unix filesystem
semantics) or "generic_delete_inode" (for filesystems that do not
want to cache inodes - causing "delete_inode" to always be
called regardless of the value of i_nlink)
The "generic_delete_inode()" behaviour is equivalent to the
old practice of using "force_delete" in the put_inode() case,
but does not have the races that the "force_delete()" approach
had.
delete_inode: called when the VFS wants to delete an inode
notify_change: called when VFS inode attributes are changed. If this
is NULL the VFS falls back to the write_inode() method. This
is called with the kernel lock held
put_super: called when the VFS wishes to free the superblock
(i.e. unmount). This is called with the superblock lock held
write_super: called when the VFS superblock needs to be written to
disc. This method is optional
statfs: called when the VFS needs to get filesystem statistics. This
is called with the kernel lock held
remount_fs: called when the filesystem is remounted. This is called
with the kernel lock held
clear_inode: called then the VFS clears the inode. Optional
The read_inode() method is responsible for filling in the "i_op"
field. This is a pointer to a "struct inode_operations" which
describes the methods that can be performed on individual inodes.
struct inode_operations <section>
=======================
This describes how the VFS can manipulate an inode in your
filesystem. As of kernel 2.1.99, the following members are defined:
struct inode_operations {
struct file_operations * default_file_ops;
int (*create) (struct inode *,struct dentry *,int);
int (*lookup) (struct inode *,struct dentry *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char *,int);
struct dentry * (*follow_link) (struct dentry *, struct dentry *);
int (*readpage) (struct file *, struct page *);
int (*writepage) (struct page *page, struct writeback_control *wbc);
int (*bmap) (struct inode *,int);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int);
int (*smap) (struct inode *,int);
int (*updatepage) (struct file *, struct page *, const char *,
unsigned long, unsigned int, int);
int (*revalidate) (struct dentry *);
};
Again, all methods are called without any locks being held, unless
otherwise noted.
default_file_ops: this is a pointer to a "struct file_operations"
which describes how to open and then manipulate open files
create: called by the open(2) and creat(2) system calls. Only
required if you want to support regular files. The dentry you
get should not have an inode (i.e. it should be a negative
dentry). Here you will probably call d_instantiate() with the
dentry and the newly created inode
lookup: called when the VFS needs to look up an inode in a parent
directory. The name to look for is found in the dentry. This
method must call d_add() to insert the found inode into the
dentry. The "i_count" field in the inode structure should be
incremented. If the named inode does not exist a NULL inode
should be inserted into the dentry (this is called a negative
dentry). Returning an error code from this routine must only
be done on a real error, otherwise creating inodes with system
calls like create(2), mknod(2), mkdir(2) and so on will fail.
If you wish to overload the dentry methods then you should
initialise the "d_dop" field in the dentry; this is a pointer
to a struct "dentry_operations".
This method is called with the directory inode semaphore held
link: called by the link(2) system call. Only required if you want
to support hard links. You will probably need to call
d_instantiate() just as you would in the create() method
unlink: called by the unlink(2) system call. Only required if you
want to support deleting inodes
symlink: called by the symlink(2) system call. Only required if you
want to support symlinks. You will probably need to call
d_instantiate() just as you would in the create() method
mkdir: called by the mkdir(2) system call. Only required if you want
to support creating subdirectories. You will probably need to
call d_instantiate() just as you would in the create() method
rmdir: called by the rmdir(2) system call. Only required if you want
to support deleting subdirectories
mknod: called by the mknod(2) system call to create a device (char,
block) inode or a named pipe (FIFO) or socket. Only required
if you want to support creating these types of inodes. You
will probably need to call d_instantiate() just as you would
in the create() method
readlink: called by the readlink(2) system call. Only required if
you want to support reading symbolic links
follow_link: called by the VFS to follow a symbolic link to the
inode it points to. Only required if you want to support
symbolic links
struct file_operations <section>
======================
This describes how the VFS can manipulate an open file. As of kernel
2.1.99, the following members are defined:
struct file_operations {
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *);
int (*fasync) (struct file *, int);
int (*check_media_change) (kdev_t dev);
int (*revalidate) (kdev_t dev);
int (*lock) (struct file *, int, struct file_lock *);
};
Again, all methods are called without any locks being held, unless
otherwise noted.
llseek: called when the VFS needs to move the file position index
read: called by read(2) and related system calls
write: called by write(2) and related system calls
readdir: called when the VFS needs to read the directory contents
poll: called by the VFS when a process wants to check if there is
activity on this file and (optionally) go to sleep until there
is activity. Called by the select(2) and poll(2) system calls
ioctl: called by the ioctl(2) system call
mmap: called by the mmap(2) system call
open: called by the VFS when an inode should be opened. When the VFS
opens a file, it creates a new "struct file" and initialises
the "f_op" file operations member with the "default_file_ops"
field in the inode structure. It then calls the open method
for the newly allocated file structure. You might think that
the open method really belongs in "struct inode_operations",
and you may be right. I think it's done the way it is because
it makes filesystems simpler to implement. The open() method
is a good place to initialise the "private_data" member in the
file structure if you want to point to a device structure
release: called when the last reference to an open file is closed
fsync: called by the fsync(2) system call
fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
support routines in the VFS which will locate the required device
driver information. These support routines replace the filesystem file
operations with those for the device driver, and then proceed to call
the new open() method for the file. This is how opening a device file
in the filesystem eventually ends up calling the device driver open()
method. Note the devfs (the Device FileSystem) has a more direct path
from device node to device driver (this is an unofficial kernel
patch).
Directory Entry Cache (dcache) <section>
------------------------------
struct dentry_operations
========================
This describes how a filesystem can overload the standard dentry
operations. Dentries and the dcache are the domain of the VFS and the
individual filesystem implementations. Device drivers have no business
here. These methods may be set to NULL, as they are either optional or
the VFS uses a default. As of kernel 2.1.99, the following members are
defined:
struct dentry_operations {
int (*d_revalidate)(struct dentry *);
int (*d_hash) (struct dentry *, struct qstr *);
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
void (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
};
d_revalidate: called when the VFS needs to revalidate a dentry. This
is called whenever a name look-up finds a dentry in the
dcache. Most filesystems leave this as NULL, because all their
dentries in the dcache are valid
d_hash: called when the VFS adds a dentry to the hash table
d_compare: called when a dentry should be compared with another
d_delete: called when the last reference to a dentry is
deleted. This means no-one is using the dentry, however it is
still valid and in the dcache
d_release: called when a dentry is really deallocated
d_iput: called when a dentry loses its inode (just prior to its
being deallocated). The default when this is NULL is that the
VFS calls iput(). If you define this method, you must call
iput() yourself
Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
directory.
Directory Entry Cache APIs
--------------------------
There are a number of functions defined which permit a filesystem to
manipulate dentries:
dget: open a new handle for an existing dentry (this just increments
the usage count)
dput: close a handle for a dentry (decrements the usage count). If
the usage count drops to 0, the "d_delete" method is called
and the dentry is placed on the unused list if the dentry is
still in its parents hash list. Putting the dentry on the
unused list just means that if the system needs some RAM, it
goes through the unused list of dentries and deallocates them.
If the dentry has already been unhashed and the usage count
drops to 0, in this case the dentry is deallocated after the
"d_delete" method is called
d_drop: this unhashes a dentry from its parents hash list. A
subsequent call to dput() will dellocate the dentry if its
usage count drops to 0
d_delete: delete a dentry. If there are no other open references to
the dentry then the dentry is turned into a negative dentry
(the d_iput() method is called). If there are other
references, then d_drop() is called instead
d_add: add a dentry to its parents hash list and then calls
d_instantiate()
d_instantiate: add a dentry to the alias hash list for the inode and
updates the "d_inode" member. The "i_count" member in the
inode structure should be set/incremented. If the inode
pointer is NULL, the dentry is called a "negative
dentry". This function is commonly called when an inode is
created for an existing negative dentry
d_lookup: look up a dentry given its parent and path name component
It looks up the child of that given name from the dcache
hash table. If it is found, the reference count is incremented
and the dentry is returned. The caller must use d_put()
to free the dentry when it finishes using it.
RCU-based dcache locking model
------------------------------
On many workloads, the most common operation on dcache is
to look up a dentry, given a parent dentry and the name
of the child. Typically, for every open(), stat() etc.,
the dentry corresponding to the pathname will be looked
up by walking the tree starting with the first component
of the pathname and using that dentry along with the next
component to look up the next level and so on. Since it
is a frequent operation for workloads like multiuser
environments and webservers, it is important to optimize
this path.
Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
in every component during path look-up. Since 2.5.10 onwards,
fastwalk algorithm changed this by holding the dcache_lock
at the beginning and walking as many cached path component
dentries as possible. This signficantly decreases the number
of acquisition of dcache_lock. However it also increases the
lock hold time signficantly and affects performance in large
SMP machines. Since 2.5.62 kernel, dcache has been using
a new locking model that uses RCU to make dcache look-up
lock-free.
The current dcache locking model is not very different from the existing
dcache locking model. Prior to 2.5.62 kernel, dcache_lock
protected the hash chain, d_child, d_alias, d_lru lists as well
as d_inode and several other things like mount look-up. RCU-based
changes affect only the way the hash chain is protected. For everything
else the dcache_lock must be taken for both traversing as well as
updating. The hash chain updations too take the dcache_lock.
The significant change is the way d_lookup traverses the hash chain,
it doesn't acquire the dcache_lock for this and rely on RCU to
ensure that the dentry has not been *freed*.
Dcache locking details
----------------------
For many multi-user workloads, open() and stat() on files are
very frequently occurring operations. Both involve walking
of path names to find the dentry corresponding to the
concerned file. In 2.4 kernel, dcache_lock was held
during look-up of each path component. Contention and
cacheline bouncing of this global lock caused significant
scalability problems. With the introduction of RCU
in linux kernel, this was worked around by making
the look-up of path components during path walking lock-free.
Safe lock-free look-up of dcache hash table
===========================================
Dcache is a complex data structure with the hash table entries
also linked together in other lists. In 2.4 kernel, dcache_lock
protected all the lists. We applied RCU only on hash chain
walking. The rest of the lists are still protected by dcache_lock.
Some of the important changes are :
1. The deletion from hash chain is done using hlist_del_rcu() macro which
doesn't initialize next pointer of the deleted dentry and this
allows us to walk safely lock-free while a deletion is happening.
2. Insertion of a dentry into the hash table is done using
hlist_add_head_rcu() which take care of ordering the writes -
the writes to the dentry must be visible before the dentry
is inserted. This works in conjuction with hlist_for_each_rcu()
while walking the hash chain. The only requirement is that
all initialization to the dentry must be done before hlist_add_head_rcu()
since we don't have dcache_lock protection while traversing
the hash chain. This isn't different from the existing code.
3. The dentry looked up without holding dcache_lock by cannot be
returned for walking if it is unhashed. It then may have a NULL
d_inode or other bogosity since RCU doesn't protect the other
fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
indicate unhashed dentries and use this in conjunction with a
per-dentry lock (d_lock). Once looked up without the dcache_lock,
we acquire the per-dentry lock (d_lock) and check if the
dentry is unhashed. If so, the look-up is failed. If not, the
reference count of the dentry is increased and the dentry is returned.
4. Once a dentry is looked up, it must be ensured during the path
walk for that component it doesn't go away. In pre-2.5.10 code,
this was done holding a reference to the dentry. dcache_rcu does
the same. In some sense, dcache_rcu path walking looks like
the pre-2.5.10 version.
5. All dentry hash chain updations must take the dcache_lock as well as
the per-dentry lock in that order. dput() does this to ensure
that a dentry that has just been looked up in another CPU
doesn't get deleted before dget() can be done on it.
6. There are several ways to do reference counting of RCU protected
objects. One such example is in ipv4 route cache where
deferred freeing (using call_rcu()) is done as soon as
the reference count goes to zero. This cannot be done in
the case of dentries because tearing down of dentries
require blocking (dentry_iput()) which isn't supported from
RCU callbacks. Instead, tearing down of dentries happen
synchronously in dput(), but actual freeing happens later
when RCU grace period is over. This allows safe lock-free
walking of the hash chains, but a matched dentry may have
been partially torn down. The checking of DCACHE_UNHASHED
flag with d_lock held detects such dentries and prevents
them from being returned from look-up.
Maintaining POSIX rename semantics
==================================
Since look-up of dentries is lock-free, it can race against
a concurrent rename operation. For example, during rename
of file A to B, look-up of either A or B must succeed.
So, if look-up of B happens after A has been removed from the
hash chain but not added to the new hash chain, it may fail.
Also, a comparison while the name is being written concurrently
by a rename may result in false positive matches violating
rename semantics. Issues related to race with rename are
handled as described below :
1. Look-up can be done in two ways - d_lookup() which is safe
from simultaneous renames and __d_lookup() which is not.
If __d_lookup() fails, it must be followed up by a d_lookup()
to correctly determine whether a dentry is in the hash table
or not. d_lookup() protects look-ups using a sequence
lock (rename_lock).
2. The name associated with a dentry (d_name) may be changed if
a rename is allowed to happen simultaneously. To avoid memcmp()
in __d_lookup() go out of bounds due to a rename and false
positive comparison, the name comparison is done while holding the
per-dentry lock. This prevents concurrent renames during this
operation.
3. Hash table walking during look-up may move to a different bucket as
the current dentry is moved to a different bucket due to rename.
But we use hlists in dcache hash table and they are null-terminated.
So, even if a dentry moves to a different bucket, hash chain
walk will terminate. [with a list_head list, it may not since
termination is when the list_head in the original bucket is reached].
Since we redo the d_parent check and compare name while holding
d_lock, lock-free look-up will not race against d_move().
4. There can be a theoritical race when a dentry keeps coming back
to original bucket due to double moves. Due to this look-up may
consider that it has never moved and can end up in a infinite loop.
But this is not any worse that theoritical livelocks we already
have in the kernel.
Important guidelines for filesystem developers related to dcache_rcu
====================================================================
1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
don't change. Only dcache internal implementation changes. However
filesystems *must not* delete from the dentry hash chains directly
using the list macros like allowed earlier. They must use dcache
APIs like d_drop() or __d_drop() depending on the situation.
2. d_flags is now protected by a per-dentry lock (d_lock). All
access to d_flags must be protected by it.
3. For a hashed dentry, checking of d_count needs to be protected
by d_lock.
Papers and other documentation on dcache locking
================================================
1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
2. http://lse.sourceforge.net/locking/dcache/dcache.html

View File

@@ -0,0 +1,188 @@
The SGI XFS Filesystem
======================
XFS is a high performance journaling filesystem which originated
on the SGI IRIX platform. It is completely multi-threaded, can
support large files and large filesystems, extended attributes,
variable block sizes, is extent based, and makes extensive use of
Btrees (directories, extents, free space) to aid both performance
and scalability.
Refer to the documentation at http://oss.sgi.com/projects/xfs/
for further details. This implementation is on-disk compatible
with the IRIX version of XFS.
Mount Options
=============
When mounting an XFS filesystem, the following options are accepted.
biosize=size
Sets the preferred buffered I/O size (default size is 64K).
"size" must be expressed as the logarithm (base2) of the
desired I/O size.
Valid values for this option are 14 through 16, inclusive
(i.e. 16K, 32K, and 64K bytes). On machines with a 4K
pagesize, 13 (8K bytes) is also a valid size.
The preferred buffered I/O size can also be altered on an
individual file basis using the ioctl(2) system call.
ikeep/noikeep
When inode clusters are emptied of inodes, keep them around
on the disk (ikeep) - this is the traditional XFS behaviour
and is still the default for now. Using the noikeep option,
inode clusters are returned to the free space pool.
logbufs=value
Set the number of in-memory log buffers. Valid numbers range
from 2-8 inclusive.
The default value is 8 buffers for filesystems with a
blocksize of 64K, 4 buffers for filesystems with a blocksize
of 32K, 3 buffers for filesystems with a blocksize of 16K
and 2 buffers for all other configurations. Increasing the
number of buffers may increase performance on some workloads
at the cost of the memory used for the additional log buffers
and their associated control structures.
logbsize=value
Set the size of each in-memory log buffer.
Size may be specified in bytes, or in kilobytes with a "k" suffix.
Valid sizes for version 1 and version 2 logs are 16384 (16k) and
32768 (32k). Valid sizes for version 2 logs also include
65536 (64k), 131072 (128k) and 262144 (256k).
The default value for machines with more than 32MB of memory
is 32768, machines with less memory use 16384 by default.
logdev=device and rtdev=device
Use an external log (metadata journal) and/or real-time device.
An XFS filesystem has up to three parts: a data section, a log
section, and a real-time section. The real-time section is
optional, and the log section can be separate from the data
section or contained within it.
noalign
Data allocations will not be aligned at stripe unit boundaries.
noatime
Access timestamps are not updated when a file is read.
norecovery
The filesystem will be mounted without running log recovery.
If the filesystem was not cleanly unmounted, it is likely to
be inconsistent when mounted in "norecovery" mode.
Some files or directories may not be accessible because of this.
Filesystems mounted "norecovery" must be mounted read-only or
the mount will fail.
nouuid
Don't check for double mounted file systems using the file system uuid.
This is useful to mount LVM snapshot volumes.
osyncisosync
Make O_SYNC writes implement true O_SYNC. WITHOUT this option,
Linux XFS behaves as if an "osyncisdsync" option is used,
which will make writes to files opened with the O_SYNC flag set
behave as if the O_DSYNC flag had been used instead.
This can result in better performance without compromising
data safety.
However if this option is not in effect, timestamp updates from
O_SYNC writes can be lost if the system crashes.
If timestamp updates are critical, use the osyncisosync option.
quota/usrquota/uqnoenforce
User disk quota accounting enabled, and limits (optionally)
enforced.
grpquota/gqnoenforce
Group disk quota accounting enabled and limits (optionally)
enforced.
sunit=value and swidth=value
Used to specify the stripe unit and width for a RAID device or
a stripe volume. "value" must be specified in 512-byte block
units.
If this option is not specified and the filesystem was made on
a stripe volume or the stripe width or unit were specified for
the RAID device at mkfs time, then the mount system call will
restore the value from the superblock. For filesystems that
are made directly on RAID devices, these options can be used
to override the information in the superblock if the underlying
disk layout changes after the filesystem has been created.
The "swidth" option is required if the "sunit" option has been
specified, and must be a multiple of the "sunit" value.
sysctls
=======
The following sysctls are available for the XFS filesystem:
fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
Setting this to "1" clears accumulated XFS statistics
in /proc/fs/xfs/stat. It then immediately resets to "0".
fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
The interval at which the xfssyncd thread flushes metadata
out to disk. This thread will flush log activity out, and
do some processing on unlinked inodes.
fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000)
The interval at which xfsbufd scans the dirty metadata buffers list.
fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000)
The age at which xfsbufd flushes dirty metadata buffers to disk.
fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
A volume knob for error reporting when internal errors occur.
This will generate detailed messages & backtraces for filesystem
shutdowns, for example. Current threshold values are:
XFS_ERRLEVEL_OFF: 0
XFS_ERRLEVEL_LOW: 1
XFS_ERRLEVEL_HIGH: 5
fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)
Causes certain error conditions to call BUG(). Value is a bitmask;
AND together the tags which represent errors which should cause panics:
XFS_NO_PTAG 0
XFS_PTAG_IFLUSH 0x00000001
XFS_PTAG_LOGRES 0x00000002
XFS_PTAG_AILDELETE 0x00000004
XFS_PTAG_ERROR_REPORT 0x00000008
XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010
XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
This option is intended for debugging only.
fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
Controls whether symlinks are created with mode 0777 (default)
or whether their mode is affected by the umask (irix mode).
fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
Controls files created in SGID directories.
If the group ID of the new file does not match the effective group
ID or one of the supplementary group IDs of the parent dir, the
ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
is set.
fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1)
Controls whether unprivileged users can use chown to "give away"
a file to another user.
fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1)
Setting this to "1" will cause the "sync" flag set
by the chattr(1) command on a directory to be
inherited by files in that directory.
fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1)
Setting this to "1" will cause the "nodump" flag set
by the chattr(1) command on a directory to be
inherited by files in that directory.
fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1)
Setting this to "1" will cause the "noatime" flag set
by the chattr(1) command on a directory to be
inherited by files in that directory.