Linux/Clearcase file deletions?
I had a very strange event recently. A number of directories in an automounted NFS directory were being deleted every couple of minutes. This may be a bug in clearcase that is related specifically to Linux Redhat 4.x and can cause random deletion of files. Gulp!
This claims to have been fixed in Clearcase Patch 42 – we were at patch 49.
The problem is described in this bug report and also in this tech note.
To trouble shoot this, I created a loop on the command line that would tell me when the directory disappeared. At the same time, I did a snoop on the machine that contained the directory. This pointed me to another machine. Snooping on the suspect machine showed network calls such as rm and rmdir. The /var/log/messages file showed stuff simlar to this:
automount[11495]: rm_unwanted: unable to remove link: /path/truncated
, error: Permission denied
I shutdown clearcase and the problem remained. After unmounting and remounting the directory, the problem went away. I am including the tech note in full:
Problem
This technote identifies an IBM® Rational® ClearCase® defect relating to the MVFS on RedHat® Enterprise 4 Linux® (RHEL) where applied patches prior to clearcase_p2003.06.00-42 or 7.0.0.1-RATLC-RCC could cause file deletion under very specific circumstances while VOBs are mounted on the RHEL 4 file system and provides you with information on how to workaround the issue.
Cause
A Linux radix tree is a mechanism by which a (pointer) value can be associated with a (long) integer key. It is reasonably efficient in terms of storage, and is quite quick on lookups. Additionally, radix trees in the Linux kernel have some features driven by kernel-specific needs, including the ability to associate tags with specific entries. A radix tree contains leaf nodes, which contain slots, each of which can contain a pointer to something of interest to the creator of the tree. Empty slots contain a NULL pointer.
The radix tree in question is the one that records allocated and free minor device numbers for the anonymous devices [unnamed_dev_idr]: These are used for NFS mounts. During kernel debugger investigations, the free/in-use bits in the radix tree did not match up with mounted file systems device numbers: The device numbers were still in use while the tree reported that they were free.
The consequences are that an NFS mount can obtain the same device number. AutoFS uses these device numbers as part of its strategy to be sure it does not delete anything that was not created by itself when the cleanup process is triggered.
To trigger the condition described, two super blocks would have to be allocated the same device number (s_dev). In the cases we have seen, the two superblocks involved were an AutoFS mount and an NFS mount. When the AutoFS mount timed out, the AutoFS started to clean up prior created files, resulting in the deletion of files on the NFS mount.
If the unnamed_dev_idr radix tree becomes corrupt, then the cleanup of the mount tree can remove file systems that it was not intended. The function that does the cleanup is radix_tree_delete.
Defect APAR PK32893 has been submitted to address this behavior.
Solution
This specific problem only exists on RedHat 4 Enterprise systems with 2.6.9 kernels.
To avoid this defect, apply the User Space and MVFS patches for ClearCase 2003.06.00 and 7.0 servers (and clients) running RedHat Enterprise 4 systems with 2.6.9 kernels if the MVFS installed.
Refer to APAR PK32893 to download the patches.
Note: The MultiSite patch, multisite_p2003.06.00-16 (or the latest available patch), is also required if your VOBs are replicated to stay in sync with the User Space and MVFS updates.