Monday, February 25, 2008

Upgrading Unix OS Mlink problem

For those of you who have or are contemplating an upgrade to your Unix operating system, you should be aware of a problem that arises when you do so.

When Mlink transfers a file, by default it uses a "temporary receive file" until the transfer is completed. It then renames the temporary file to it's permanent name. In order to prevent temporary file name conflicts, Mlink uses a unique numeric naming convention. This name is derived from the 5 left justified characters of the Unix process ID of the process transferring the file. This scheme worked fine since the process ID was only 5 characters in length. 64 bit Unix operating systems can have substantially longer process IDs, which means that two Mlink processes can end up trying to use the same temporary file name during a transfer. This doesn't happen very often, but when it does, the contents of two files from separate remote locations ends up in the same file on the receiving system. The second transfer to complete reports a "Local File I/O error. #FILE = 1" error and fails the transfer.

Note that we've seen this problem occur even though the upgraded Unix kernal was built as 32 bit.

Send us an email at drmlink at mlink dot com and we'll tell you how to get the patch that fixes this problem.

5 comments:

RLR said...

Thanks for the blog, although I'm not real keen on Google authentication.

In a related vein, is there any way to control ACM's timeout on RD commands?

Rick Cone said...

Mlink 6.5.1 mass uses #UNIQUEID to combat OS with 64 bit process ID's (like AIX)...

Jerry Nicholson said...

There are two methods for changing the timeout value in ACM. You can add a local system command to a task list using "set_timeout nn" as the command. "nn" is the desired timeout (in tenths of a second). Add this task just prior to the remote command. To reset to the normal timeout in that task list, follow the remote command with another local system command, "reset_timeout".

The second method involves editing the "main.run" file in the database directory. The first record in that file defines how the port process' will behave. You can append a "-Lnnn" for login timeouts, "-Snnnn" for remote script execution and "-Tnnnn" for file transfer and remote command timeouts. The amproc line in the file might then look like the following:

0|P|amproc -T3000

3000 tenths would provide a 5 minute timeout.

Using this second method means you are stuck with whatever values are in the main.run file. If you use this method note that you'll have to stop and then activate the ports for the new values to take effect.

Make sure you don't mess up the main.run file or the ACM database won't start! Always make a backup copy of main.run so you can back out any bogus changes.

Rick is correct - the patch utilizes a new UNIQUEID variable to store a 12 byte string value. The CA reference number to this fix is APAR # QO91043

RLR said...

Very cool. Hadn't seen any references to these in the manuals.

Not sure how precise they are, that is to say, what interval they're measuring. Set_timeout 600 didn't cause a timeout for an SD that took over 2 minutes; set_timeout 2 let the same SD run for 20 seconds before timing it out. But it helps, especially since it returns command execution to the .TKQ.

For commands that could take a while to complete I've been backgrounding them and doing a local sleep. But that didn't help for RD and SD which I knew used some local timer.

Jerry Nicholson said...

Changing the timeout value will have no effect on a file transfer as long as the transfer is progressing normally. The reason for this is that the timeout value is an IDLE state timer. That is, if no data whatsoever is received in the timeout period, the timer will go off. If the line is not idle, the timer will not go off.

During a file transfer, there is a fair amount of chatter going back and forth as one side is sending data while the other side returns an ACK (acknowledgement) for every frame of data it receives. There is very little idle time so the timer doesn't take effect.

During a remote command, the sending side formats the command into an Mlink protocol block and sends it to the remote. The remote executes the command, during which time it isn't ACKing anything and the sending side is sitting there, idle, listening for the ACK. This is where the (idle state) timeout value comes into play.

In fact, this situation causes a common error which shows up in the ACM log as "Host Mode Failed". The sending side sends the command in a protocol block and then listens. If the command takes a long time, the sender times out and, since it's following the protocol rules, it [i]retries[/i] the last protocol block (the remote command). It will resend this command for the retry count if it continues to time out.

Meanwhile, the remote finishes the first command and finds the resent command waiting for it so it executes it again... and again, and again, etc. Eventually, the sending side uses up all of it's retries and fails that command. Or, it might get the ACK from the first retry and think it's the ACK from the last retry it sent. When it moves on to the next task it tries to set up the remote again, but the remote is busy executing all of those retries so the sending side decides the remote is munged and terminates with a host mode failure.

set_timeout will prevent this situation from occuring as long as the timeout value is longer than the remote command takes to execute.