Monday, June 11, 2012

Quick and easy way to monitor for memory leaks in Java application servers

I was part of a postmortem analysis team that performed a root cause analysis on why a particular node in a HA paired system failed and needed to provide recommendations to prevent future occurrences. Initial examination of the log showed that the node that failed had a permgen out of memory problem. Management team also wanted to know if there were memory leaks with the application.

Unfortunately, the Java application server was not configured with GC logging nor configured to perform a heap dump on out of memory error condition. Without the gc log data or the heap dump, it becomes harder to be able to explain why did the application server run out of permgen space at the time it failed. In the interest of collecting data and to prevent future node failures, we recommended that permgen space be increased and add the following startup parameters to the Java application server


One problem that arose during the writeup of the analysis was how to answer the question whether the application had any memory leak issue. The application team could not reproduce the problem in non-production environment but wanted some assurance that the application did not have a memory leak problem with PermGen space or anywhere else in the heap. I proposed that we track heap usage after every Full GC over a period of time and see the heap usage trends over time. The heap usage trends should provide a good indication if the application has a problematic memory leak issue or not.

In order to do so, I wrote a small script to parse out only the heap and permgen space info after a full GC from the gc log. I wrote this script in Clojure as follows:

; Define the regex pattern to parse each line in gc log
(defn full-gc-pattern []
    (let [timestamp  "([\\d\\.]+): .* " 
          space      "(\\d+)K->(\\d+)K\\((\\d+)K.+ "
          new-gen    space
          old-gen    space
          perm-gen   space
          heap       "(\\d+)K->(\\d+)K\\((\\d+)K\\)\\], "
          exec-stat  "(\\d+\\.\\d+).*" ]
         (re-pattern (str timestamp new-gen old-gen perm-gen heap exec-stat))))                 

; We only want to dump timestamp, total heap and perm gen space
; after each full GC
; Variable definitions:
;     ts - timestamp (in seconds)
;     ys - YoungGen space starting heap size (in KB)
;     ye - YoungGen space ending heap size (in KB)
;     ym - YoungGen space max heap size (in KB)
;     os - OldGen space starting heap size (in KB)
;     oe - OldGen space ending heap size (in KB)
;     om - OldGen space max heap size (in KB)
;     hs - Total heap space starting heap size (in KB)
;     he - Total heap space ending heap size (in KB)
;     hm - Total heap space max heap size (in KB)
;     ps - PermGen space starting heap size (in KB)
;     pe - PermGen space ending heap size (in KB)
;     pm - PermGen space max heap size (in KB)
(defn process-full-gc [entry]
    (let [parsed (first (re-seq (full-gc-pattern) entry))
         [a ts ys ye ym os oe om hs he hm ps pe pm & e] parsed]
        (println (format "%s,%s,%s" ts he pe))))

; Load the gc log data
(def gcdata (line-seq ( ( "gc.log"))))
; Process each full GC entry after filtering out minor GC entries          
(doseq [line (keep #(re-find #".*Full GC.*" %) gcdata)]
    (process-full-gc line))

Using this script against the gc log file, and taking that output and loading into a spreadsheet, I can plot and generate a trendline to see if memory is growing or not. Here's the heap usage plot over approximately a month.

As seen on the chart, the slope of the trendlines are small and negative, which is a good indicator that we do not have a memory leak problem with this application.