Tuesday, March 06, 2012

Adventures in troubleshooting out of memory errors with Coherence cluster.

One day, an application team manager called me and said that their application caused an out of memory error condition in their Oracle Coherence cluster. This same code base ran in the old Coherence 3.1 environment for months without running into out of memory conditions and now is failing in the new Coherence 3.6 environment in matter of a few weeks on a regular basis. He said that he had heap dumps and logs and asked whether I could take a look at it and troubleshoot it.

Initially, I was skeptical about being able to help this team manager out. After all, I know almost nothing about their application code and in all practical terms, I had no previous development experience with Coherence with the exception that I read the book Oracle Coherence 3.5 by Aleksandar Seovic in the past. My previously participated in testing Coherence performance on VMware and that really did not require me to delve into the Coherence API at all.

Despite these misgivings, I decided to provide my support and told the application team manager that I'll try my best.

The system with problems was a multi-node Coherence cluster. When I took a look at the logs, all of them had these similar verbose GC output:

[GC [1 CMS-initial-mark: 1966076K(1966080K)] 2083794K(2084096K), 0.1923110 secs] [Times: user=0.18 sys=0.00, real=0.19 secs] 
[Full GC [CMS[CMS-concurrent-mark: 1.624/1.626 secs] [Times: user=3.22 sys=0.00, real=1.62 secs] 
 (concurrent mode failure): 1966079K->1966078K(1966080K), 6.6177340 secs] 2084093K->2084082K(2084096K), [CMS Perm : 13617K->13617K(23612K)], 6.6177900 secs] [Times: user=8.21 sys=0.00, real=6.62 secs] 
[Full GC [CMS: 1966078K->1966078K(1966080K), 4.1110330 secs] 2084093K->2084089K(2084096K), [CMS Perm : 13617K->13615K(23612K)], 4.1111070 secs] [Times: user=4.11 sys=0.00, real=4.11 secs] 
[Full GC [CMS: 1966078K->1966078K(1966080K), 4.2973090 secs] 2084092K->2084087K(2084096K), [CMS Perm : 13615K->13615K(23612K)], 4.2973630 secs] [Times: user=4.28 sys=0.00, real=4.30 secs] 
[Full GC [CMS: 1966078K->1966078K(1966080K), 4.1831450 secs] 2084093K->2084093K(2084096K), [CMS Perm : 13615K->13615K(23612K)], 4.1831970 secs] [Times: user=4.18 sys=0.00, real=4.18 secs] 
[Full GC [CMS: 1966078K->1966078K(1966080K), 4.2524850 secs] 2084093K->2084093K(2084096K), [CMS Perm : 13615K->13615K(23612K)], 4.2525380 secs] [Times: user=4.24 sys=0.00, real=4.25 secs] 
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid23607.hprof ...
Heap dump file created [2274434953 bytes in 28.968 secs]

This garbage collection log output tells me that they are using CMS for the JVM GC. The concurrent mode failure entries certainly grabbed my attention. Normally one would fix concurrent mode failures by tuning the CMS initiation occupancy fraction via -XX:CMSInitiatingOccupancyFraction flag, but in this case, looking that the heap numbers in the lines labeled "Full GC" showed that GC could not clean up any memory at all. So this problem could not be solved by GC tuning. By the way, for a great book on tuning garbage collection, I would recommend Charlie Hunt's book Java Performance.

My next step was to take a look a the heap. The heap was slightly over 2 GB, which was expected since Coherence cluster node was each configured with a 2GB heap. Well, that presented a problem for me because I'm still mainly working on a 32-bit Windows laptop. I needed to find a 64-bit system with preferably 4 GB of ram or more to look at this. Once I was able to get such a machine and fired up Eclipse Memory Analyzer Tool (MAT). Once I looked at the heap, it was pretty obvious what was the biggest memory offender. The biggest memory offender was a top level hashmap chewing up 1.6 GB of memory. Delving further into that hash map structure, it reveals that Coherence caching structure is a hash of hashes. Looking at the hashes, I notice that there were over 2000+ items in the top level hash. That would imply that there were over 2000+ caches in the Coherence cluster. Studying each individual cache, I would notice cache names like

  • alpha-FEB-21
  • alpha-FEB-22
  • alpha-FEB-23
  • alpha-FEB-24
  • beta-FEB-21
  • beta-FEB-22
  • beta-FEB-23
  • beta-FEB-24

and so forth. I ask the application team manager whether he expected to have this many caches in the cluster. The application team manager said no; he expected a much smaller set of caches. The application normally destroy caches older than 2 days. The developers provided me their code related to the creation and destruction of caches and I saw the following lines of code and it seems pretty innocuous:

    public static void destroyCache(String name) {
        Collection listOfCacheNames = getListOfCacheNames(name, false);
        Iterator iterator = listOfCacheNames.iterator();
        while (iterator.hasNext()) {
            String name = (String) iterator.next();
            NamedCache namedCache = CacheFactory.getCache(name);
            namedCache.destroy();
        }
    }

I went back to the memory analyzer tool and performed a GC to root analysis and saw the top level object that's holding onto this heap as:

com.tangosol.coherence.component.net.Cluster$IpMonitor @ 0x77ff4b18 

with the label "Busy Monitor" next to it. This line item seems to suggest that there's a monitor lock on this cache. Looking at the Coherence API documentation, I see the following entry:


destroy

void destroy()
Release and destroy this instance of NamedCache.

Warning: This method is used to completely destroy the specified cache across the cluster. All references in the entire cluster to this cache will be invalidated, the cached data will be cleared, and all internal resources will be released.

Caches should be destroyed by the same mechansim in which they were obtained. For example:

  • new Cache() - cache.destroy()
  • CacheFactory.getCache() - CacheFactory.destroyCache()
  • ConfigurableCacheFactory.ensureCache() - ConfigurableCacheFactory.destroyCache()
Except for the case where the application code expicitly allocated the cache, this method should not be called by application code.

Looking at this documentation, we initially thought that since the cache was obtained via CacheFactory and therefore should be destroyed via CacheFactory ergo CacheFactory had a monitor lock on the underlying collections. The code provided by the developers used one mechanism to create the cache and another mechanism to destroy the cache so we presume that was the problem. So I implemented a test script to test out that theory and surprisingly, even destroying via CacheFactory, I still encounter out of memory issues. Only by clearing the cache before destroying the cache was I able to avoid out of memory errors. Here's the script that I developed in Clojure to test my theories:

(import '(org.apache.commons.lang3 RandomStringUtils) 
        '(java.math BigInteger)
        '(java.util Random Date HashMap)
        '(com.tangosol.net NamedCache CacheFactory CacheService Cluster))

(defn random-text [] (RandomStringUtils/randomAlphanumeric 1048576))
(defn random-key [] (RandomStringUtils/randomAlphanumeric 12))
        
(CacheFactory/ensureCluster)
(def buffer (new HashMap))

(defn print-horizontal-line [c] (println (apply str (repeat 80 c))))

(def caches '("alpha" "beta" "gamma" "delta" 
              "epsilon" "zeta" "eta" "theta" 
              "iota" "kappa" "lambda" "mu" "nu"
              "xi", "omicron" "pi" "rho"
              "signma" "tau" "upsilon" "phi", 
              "chi" "psi" "omega"))

(defn load-cache [cache-name n]
    (let [cache (CacheFactory/getCache cache-name)]
         (print-horizontal-line  "=")
         (println "Creating cache : " cache-name)
         (print-horizontal-line  "=")         
         (.clear buffer)
         (dotimes [_ n] (.put buffer (random-key) (random-text)))
         (.putAll cache buffer)))


(defn recreate-oom-problem [cache-name]
    (let [cache (CacheFactory/getCache cache-name)]
         (load-cache cache-name 200)
         (.destroy cache)))
         
(defn try-fix-oom-1 [cache-name]
    (let [cache (CacheFactory/getCache cache-name)]
         (load-cache cache-name 200)
         (CacheFactory/destroyCache cache)))

(defn try-fix-oom-2 [cache-name]
    (let [cache (CacheFactory/getCache cache-name)]
         (load-cache cache-name 200)
         (.clear cache)
         (CacheFactory/destroyCache cache)))
         
; Test run recreation of original problem.  Was able to reproduce OOM issues         ; 
(doseq [cache caches] (recreate-oom-problem cache))

; Surprise! Still have OOM issues
(doseq [cache caches] (try-fix-oom-1 cache))

; No longer have OOM issues, but memory is still leaking (slowly)
(doseq [cache caches] (try-fix-oom-2 cache))
         

However, I still suspect memory leaks, it's just that my memory leak is a lot smaller now. To verify that I had a memory leak, I would run my Clojure test script and the deliberately create and fill a cache without clearing it. I then forced a full garbage collection followed by a heap dump. In memory analyzer tool, I would look up the cache that I did not clear, and list all the incoming references. Then I would look for a HashMap in the incoming references and select one of those and check for outgoing references. And in that outgoing references, I could see that the key contains the name of a cache that I had called CacheFactory.destroyCache() on and the retained heap sizes range anywhere from 24 to 160 with the sizes that seems proportional to the size of the cache name.

In conclusion, it would seem Oracle Coherence does have a memory leak issues with the cache creation and destruction process. If we clear the cache before destroying the cache, I suspect it would be a long time before the memory leak is even noticeable by this particular application.

To verify that this leak did not exist in the older 3.1 version, we ran this test code on and and was unable to reproduce the out of memory errors. We also tested this against Oracle Coherence 3.7.1 and was unable to reproduce the out of memory error. So, it looks like that this memory error is specific to Oracle Coherence 3.6 only.

Throughout this entire process, I thought that the secret sauce that enabled me to quickly learn Coherence, reproduce and troubleshoot the Coherence out of memory problem was Clojure. Clojure allowed me to interactively manipulate Coherence clusters and explore the API, which would have been a lot slower if I had to go through the normal edit-compile-run cycle with plain old Java.

2 comments:

Markus Kohler said...

Hi,
Nice example
Calling destroy outside of a finally block is dangerous.
It could happen that you get an exception before destroy is executed and then this exception would let the method return.

Of course this might really a bug in Coherence.
Regards,
Markus

John Liao said...

Hi Markus,
Thanks for pointing out that destroy is not done in a finally block. From a defensive coding perspective, that is definitely the preferred pattern. In this situation, unless the application swallowed the exceptions, there were no exceptions logged so the problem is most likely with Coherence. The Clojure test code also supports that hypothesis.