I'd like to do file locking over NFS without using lockd. The reason
I want to avoid using lockd is because many lockd implementations are
too buggy.
It is fairly easy to avoid using lockd -- just avoid using lockf() to
lock a file. Instead of using lockf(), lock a file by creating a lock
file that you open with the O_CREAT | O_EXCL flags. To unlock the
file, you merely unlink the lock file. This method is fairly
reliable, except that there is a small chance with every file lock or
unlock that something will go wrong.
(1) The lock file might be created without the client realizing that
it has been created if the file creation acknowledgement is lost due
to severe network problems. The file being locked would then remain
locked forever (until someone manually deletes the lock) because no
process would take responsibility for having locked the file. This
failure symptom is relatively benign for my purposes and if needed it
can be fixed via the approach described in the Red Hat man page for
the open() system call.
(2) When a process goes to remove its file lock, the acknowledgement
for the unlink() could be lost. If this happens, then the NFS driver
on the client could accidentally unlink a lock file created by another
process when it retries the unlink() request. This failure symptom is
pretty bad for my purposes, since it could cause a structured file to
become corrupt.
I have an idea for a slightly different way of doing file locking that
I think solves problem #2 (and also solves problem #1). What if,
instead of using a lock file to lock a file, we rename the file to
something like "filename.locked.hostname.pid"? If the rename()
acknowledgement gets lost, the client will see the rename() system
call as having failed due to the file not existing. But in this case
it can then check for the existence of "filename.locked.hostname.pid".
If this file exists, then the process knows that the rename() system
call didn't actually fail--the acknowledgement just got lost. Later,
when the process goes to unlock the file, it will rename the file back
to "filename". Again, if the rename system call appears to fail, the
process can check for the existance of "filename.locked.hostname.pid".
If the file no longer exists, then it knows the rename call really did
succeed, and again the acknowledgement just got lost.
How does this sound? Is this close to foolproof, or am missing
something?
I'm not much of an NFS expert, so I am a bit worried that there are
details of NFS client-side caching that I don't understand that would
prevent this scheme from working without some modification.
|>oug
It depends on how other processes are playing with your files; what if another
program is just scanning directories and opening files at random? If this
program doesn't know your locking scheme, it'll just walk all over your
process...
However, I have used a similar method to this (on windows) when I was playing
with a distributed file migration tool I wrote. I had a control directory
which contained a queue of command-files waiting to be processed, and each
participating process had a directory named <hostname>.<pid>, and a log
directory. When a process 'claimed' a command-file, it would try to move it
to it's own subdirectory; if that failed, it would just assume that another
process got there first and then pick another one. Once each process had
completed the specified work, it would delete the command-file and write a
log file, named <hostname>.<pid>.<command-file-name> to the log directory
specifying how everything had gone. This worked fine with about 6 'helpers',
each running two or three processes. The 'engine' ran fine - with no race
conditions evident and no file corruptions or collisions. It was a bit of a
bar-steward to *STOP* (just as any multi-process program), as busy 'remote
threads' would have to complete their work before they would see the 'all
stop' flag. The most difficult thing, however, was recovering from a fatal
error, (such as when I hit Python's recursion limit), where the supervisor
process died, but the workers continued to try to process, or when I'd made
some command error or such and the workers all 'sulked'. It was fun doing it
though, in the end, the performance boost wasn't enough to warrant all the
hassle, so I reverted back to the linear version :-)
Anyway, the point I'm making is that, if you have complete control over the
location of the files you are trying to lock, and know that no rogue
processes will interfere, you should be ok. Remember that other programs
will not honour your locks unless you modify them to do so (with windows that
may be difficult...). I would suggest moving the file to another directory,
though, rather than simply renaming it, as it's easier to see what's going
on. It's also a good idea to disambiguate your pid's by adding the hostname
to them - pid's are only guaranteed to be unique within a host, so two remote
processes on different machines could possibly (read: will eventually) end up
with the same pid .
Windows file locking is just as unpredictable as NFS, particularly when you
are mixing different versions on the same network.
Using this renaming scheme does give you one particular edge - portability, as
this should work equally well on windoze, mac and [u|li]n[i|u]ix.
Good luck,
-andyj