At a current customer of ours running a WebCenter Spaces based Intranet with WebCenter Content (UCM) as the backend we started running into some strange performance issues. The customer is running on Patch Set 4 (11.1.1.5.0). The site would slow way down at specific times during the day and become unusable for 5 minute periods. It would then clear itself up and be usable again.
During this time the user load on the site would not change so something strange had to be going on. To diagnose the problem we turned up the system audit tracing in UCM to Full Verbose and looked at the following sections: systemdatabase, searchquery, requestaudit, file*.
The main symptom that were were first seeing was that during these times where the server became unresponsive there would be upwards of 40-50 active database connections for UCM, while the average is usually less than 5. We also saw a lot of long running queries on the System Audit page that the DBAs told us were happening in milliseconds on the database side.
Looking at the tracing output we were able to confirm that the database queries were only taking milliseconds or even fractions of milliseconds. So what was keeping the connections open?
A closer look at the output yielded the following for each and every request:
[sourcecode language=”text”]
>filelock/7 04.09 09:48:06.221 IdcServer-140003 Reserving /proj/wcsshare/ucm/cs/data/alerts/
>filelock/6 04.09 09:48:06.227 IdcServer-140003 /proj/wcsshare/ucm/cs/data/alerts/–<no-agent>–Locked directory
>filelock/6 04.09 09:48:06.227 IdcServer-140003 /proj/wcsshare/ucm/cs/data/alerts/–<no-agent>–Released
[/sourcecode]
So there was a process that was locking a directory for writing on every request. Continuing on it became apparent this is what was keeping the connections open.
[sourcecode language=”text”]
>filelock/7 04.09 09:48:07.652 IdcServer-140025 Reserving /proj/wcsshare/ucm/cs/data/alerts/
>filelock/7 04.09 09:48:07.653 IdcServer-140015 Lock bounce on loop 2
>filelock/6 04.09 09:48:07.653 IdcServer-140015 /proj/wcsshare/ucm/cs/data/alerts/–<no-agent>–Next sleep interval is 720
>filelock/7 04.09 09:48:07.654 IdcServer-140025 Lock bounce on loop 0
>filelock/6 04.09 09:48:07.654 IdcServer-140025 /proj/wcsshare/ucm/cs/data/alerts/–<no-agent>–Next sleep interval is 120
>filelock/6 04.09 09:48:07.705 IdcServer-139986 /proj/wcsshare/ucm/cs/data/alerts/–<no-agent>–Locked directory
>filelock/6 04.09 09:48:07.707 IdcServer-139986 /proj/wcsshare/ucm/cs/data/alerts/–<no-agent>–Released
[/sourcecode]
As you can see in the above output there was contention between threads on the same UCM managed server (not to mention the other 3 nodes in the cluster) for locking on that directory. Since the lock attempt was happening on a filter in the service call it was keeping the database connection open until it could get the lock and release it.
We reached out to some contacts in Oracle development and it turned out that the issue was already fixed and in an available OPatch! The patch in question for PS4 is 13502977: UCM 11.1.1.5.0 patch (20111230-1105). Looking at the README reveals the following:
[sourcecode language=”text”]
2011-05-31 15:34:45 – Core – idcsvn:92934 – Bug 11781402-Moved locking of alerts directory into the if condition. This reduces the unncessary locking of the directory when there is no change in alerts.hda file.
[/sourcecode]
Moral of the story: If you are on PS4 or earlier of WebCenter Content (UCM), make sure you apply the latest OPatch with this fix. If you are on PS5 or later, the fix is already included.