Monday, March 26, 2012

F# and Windows Server AppFabric Cache

I recently started investigating Windows Server AppFabric, which is different from Windows Azure AppFabric, and trying understand how to use AppFabric, what are some of the operational support implications and functional capabilities and limitations. To help me get going, I have been reading the book Pro Windows Server AppFabric by Stephen Kaufman and going through the Windows Server AppFabric Training Kit. The free downloadable training kit was more valuable in helping me understand Windows Server AppFabric.

As always, the only way to truly learn new technology is to take it out for a spin. Here are some of the powershell scripts that I've used to create the cache and get it going:

# Import the necessary administration modules
Import-Module DistributedCacheAdministration
Import-Module DistributedCacheConfiguration

# List the available powershell commands
Get-Command -module DistributedCacheAdministration   
Get-Command -module DistributedCacheConfiguration


# These were run on the localhost of AppFabric
Use-CacheCluster

# Start the cache cluster
Start-CacheCluster

# Check to make sure it is up and running
Get-CacheHost

# Create a new cache 
New-Cache -CacheName MyTestCache -TimeToLive 60 -Expirable true

# Check cache is created
Get-Cache

# Grant local user access to cache
Grant-CacheAllowedClientAccount MyUserId

# Check security settings
Get-CacheAllowedClientAccounts

# After a few runs, check Cache statistics
Get-CacheStatistics -CacheName MyTestCache 

# Get Cache configuration information
Get-CacheConfig MyTestCache 


# Creating a Cache with HA - all server needs to be on 
# Windows Server Enterprise Edition
# Any host not on Enterprise Edition will not start cache cluster
New-Cache MyNewHACache -Secondaries 1

# Modifying the cache to enable callbacks
Set-CacheConfig -CacheName MyTestCache -NotificationsEnabled true -TimeToLive 180

For the client, I wrote some F# scripts to experiment with Windows Server AppFabric. I ran these on the local server and the loopback adapter enabled:

// I copied the libraries from c:\windows\system32\appfabric
#r @"C:\lib\Microsoft.ApplicationServer.Caching.Client.dll"
#r @"C:\lib\Microsoft.ApplicationServer.Caching.Core.dll"

open System
open System.Collections.Generic
open Microsoft.ApplicationServer.Caching

//  Make sure you grant
// grant-cacheallowedclientaccount my-user-id

// Expensive operation, do this once on startup
let dcf = new DataCacheFactory(
            let servers = new List<DataCacheServerEndpoint>()
            new DataCacheServerEndpoint("localhost",22233) |> servers.Add
            new DataCacheFactoryConfiguration(Servers=servers))
printfn "Data Cache Factory Created!"

// Testing add/get in default region
let cache = dcf.GetCache("MyTestCache")
let retval = cache.Add("mykey","hello app fabric!") 
cache.Get("mykey")

// Create a new region in the cache, this will pin to a single cache node
cache.CreateRegion("stocks")

// Helper function to create DataCacheTags
let createTags tags = Seq.map (fun tag -> new DataCacheTag(tag)) tags

   
type Company =
  { Symbol : string; Name: string; Address : string;  Phone : string; Tags :seq<string> }

// Create a bunch of company info and put it in the "stocks" region
[{Symbol="AAPL"; Name="Appl Inc."; Phone="408-996-1010"; 
  Address="1 Infinite Loop, Cupertino, CA 95014";
  Tags = ["Technology";"Nasdaq";"Personal Computers"]};
 {Symbol="CAT"; Name="Caterpillar, Inc."; Phone="309-675-1000"; 
  Address="100 North East Adams Street, Peoria, IL 61629";
  Tags = ["Dow";"Industrial Goods";"Farm & Construction Machinery"]};
 {Symbol="ACI"; Name="Arch Coal, Inc."; Phone="314-994-2700"; 
  Address="Once City Place Drive, Suite 300, St. Louis, MO 63141";
  Tags = ["Basic Materials";"Industrial Metals & Materials"]};
 {Symbol="HP"; Name="Hewlett-Packard Company"; Phone="650-857-1501"; 
  Address="3000 Hanover Street, Palo Alto, Ca 94304";
  Tags = ["Dow";"Technology"; "Diversified Computer Systems"]};
 {Symbol="JPM"; Name="JP Morgan Chase & Co."; Phone="212-270-6000"; 
  Address="270 Park Avenue, New York, NY 10017";
  Tags = ["Dow";"Financial"; "Money Center Banks"]};
 {Symbol="XOM"; Name="Exxon Mobile Corporation"; Phone="972-444-1000"; 
  Address="5959 Las Colinas Boulevard, Irving, TX 75039-2298";
  Tags = ["Dow";"Basic Materials"; "Major Integrated Oil & Gas"]};]
|> Seq.iter (fun item -> cache.Put(item.Symbol, item, (createTags item.Tags),"stocks") |> ignore)


// Get all stocks in "stocks" region as a HashMap 
let stocks = cache.GetObjectsInRegion("stocks")


// Bulk get, only available with defined regions not default region
let myportfolio = ["AAPL";"CAT";"HP";"JPM"]
let mystocks = cache.BulkGet(myportfolio,"stocks")

// Getting objects from cache by tags (again, only available in defined region)
cache.GetObjectsByTag(new DataCacheTag("Dow Jones"),"stocks")


// Getting stuff by all tags (AND filter)
let candidateTags = ["Technology";  "Personal Computers"] |> createTags
cache.GetObjectsByAllTags(candidateTags,"stocks") 

// Getting stuff by any tags (OR filter)
let interestTags = ["Technology";  "Financial"; "Basic Materials"] |> createTags
cache.GetObjectsByAnyTag(interestTags,"stocks")   

// Concurrency policy is strictly done by the client
// No explicit cache server control on concurrency

// Optimistic Currency example
cache.Remove("mykey")
let version = cache.Add("mykey","mystuff")

// Pretend another thread snuck in and modified this cache item.
cache.Put("mykey","changed value")

// This will check that somebody else modified this and throw an exception
cache.Put("mykey","another value",version)

// Pessimistic locking - locks across all cache nodes
let mutable lockHandle:DataCacheLockHandle = null
let item = cache.GetAndLock("mykey",TimeSpan.FromMinutes(60.0), &lockHandle,true)   


// This will throw error due to the first lockHandle
let mutable lockHandle2:DataCacheLockHandle = null

// This will throw an exception as it was already previously locked
let item2 = cache.GetAndLock("mykey",TimeSpan.FromMinutes(60.0),&lockHandle2, true)   

// This will totally ignore locking policies and overwrite things
// Locking is controlled by diligence on the client side.
cache.Put("mykey","no enforced concurrency on the server")


// Finally, this will update the data and unlock this item in the cache
// Need to redo the GetAndLock for this to work.  Previous statement would wipe out the lock
cache.PutAndUnlock("mykey","new stuff",lockHandle)


// Working with Callbacks
(DataCacheOperations.AddItem,
 fun cacheName regionName key version cacheOperation nd -> 
   printfn "Item added to cache : %s\n" key)
|> cache.AddCacheLevelCallback
|> ignore


// When this is call, it could be seconds or minutes before anything 
// is printed to the console
cache.Add("newkey","new stuff")















Monday, March 12, 2012

F# and SharePoint 2010

I bought a bunch of SharePoint 2010 books and haven't had the time to go through those books. Recently, I finally had some time to take look at those books and try it out in F#. Trying out examples with SharePoint 2007 was fairly painless as I get to do everything in a 32-bit environment. Trying to get Visual Studio 2010 F# to work with SharePoint 2010 was more challenging. While I was able to get the compiled F# code to work, I couldn't overcome the hurdles to get SharePoint 2010 to work with F# Interactive, which is really my prefer way to investigate SharePoint server object models. I can tell from this Stack Overflow entry that I was not alone in having trouble to get F# interactive to work with SharePoint API. The combination of .NET 3.5 framework and 64-bit defeated my effort at getting it to work. I have tried Igor Dvorkin's steps to run F# interactive in 64-bit and then taking the command line arguments for the F# compiler and apply to F# interactive where applicable. So I would start up my F# interactive with the following command line parameters:

"C:\Program Files (x86)\Microsoft F#\v4.0\fsi.exe" ^
  --debug:full ^
  --noframework ^
  --define:DEBUG ^
  --define:TRACE ^
  --optimize- ^
  --tailcalls- ^
  -r:"C:\Program Files (x86)\Reference Assemblies\Microsoft\FSharp\2.0\Runtime\v2.0\FSharp.Core.dll" ^
  -r:"C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\ISAPI\Microsoft.SharePoint.dll" ^
  -r:C:\Windows\Microsoft.NET\Framework\v2.0.50727\mscorlib.dll ^
  -r:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\v3.5\System.Core.dll" ^
  -r:C:\Windows\Microsoft.NET\Framework\v2.0.50727\System.dll ^
  -r:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\v3.0\System.ServiceModel.dll" 

When I try to run with the above arguments, I get the following error:

error FS0084: Assembly reference 
'C:\Program Files (x86)\Reference Assemblies\Microsoft\FSharp\2.0\Runtime\v2.0\FSharp.Compiler.Interactive.Settings.dll' 
was not found or is invalid 

So something in FSharp.Compiler.Interactive.Settings.dll is not working with .NET Framework 3.5. I wish there was an option in F# interactive that allows me to target the runtime .NET framework to use (e.g. --target:3.5)

The good news is that the F# compiler works fine as long as you make sure that you are targeting .NET 3.5 runtime and x64 CPU. Here's the F# version of a sample code that I pulled from the book Inside Microsoft SharePoint 2010 by Ted Pattison, Andrew Connell, Scott Hillier and David Mann

open System
open Microsoft.SharePoint

let siteUrl = "http://localhost/"
let sites = new SPSite(siteUrl)

let site = sites.RootWeb
http://www.blogger.com/blogger.g?blogID=18281936#editor/target=post;postID=3313071912798963533
seq {for i in 0..(site.Lists.Count-1) do
       if site.Lists.[i].Hidden <> true then
         yield (site.Lists.[i]) }
|> Seq.iter (fun item -> printf "%s\n" item.Title)

For those who have read my past posts on F# and SharePoint and my past blog post on Revisiting the SharePoint collection adapter for F#, I am sorry to say that I have been led astray by C# example codes and my brain temporary malfunctioned as it produced the SharePoint adapter nonsense in that previous blog post. There was no need to write any SharePoint utility library in C# to get F# to work with SPListCollection as sequences. In the above example, I used sequence expressions to generate SPList sequences from SPListCollection. Another way is to simply create a sequence by use map function as shown in the following example:

// Create a sequence by mapping over all items in the collection
let spcollection = site.Lists
[0..spcollection.Count-1] 
|> Seq.map (fun i -> spcollection.[i])
|> Seq.iter (fun item -> printf "%s\n" item.Title)

There was no need to leave the confines of F# development and it is also type safe compared to my previous clumsy attempt at converting SPListCollection into a sequence. If you really do want to use an adapter, it could also be implemented in F# as follows:

// SPCollectionAdapter written in F#
let fromSPCollection<'a,'b when 'b :> SPBaseCollection> (collection:'b) =
    let enumerator = collection.GetEnumerator()
    let rec enumerate (e:Collections.IEnumerator) acc =
        let flag = e.MoveNext()
        if flag = false then
            acc
        else
            enumerate e (e.Current :?> 'a :: acc)
    enumerate enumerator [] 

// Example usages
fromSPCollection<SPList,SPListCollection> site.Lists
|> Seq.iter (fun item  -> printf "%s\n" item.Title)
    

// webapp is an instance of SPWebApplication object
fromSPCollection<SPContentDatabase,SPContentDatabaseCollection>  webapp.ContentDatabases
|> Seq.iter (fun db -> 
              printf "Content Database : %s\n" db.Name
              printf "Connection String : %s\n" db.DatabaseConnectionString)
    

Tuesday, March 06, 2012

Adventures in troubleshooting out of memory errors with Coherence cluster.

One day, an application team manager called me and said that their application caused an out of memory error condition in their Oracle Coherence cluster. This same code base ran in the old Coherence 3.1 environment for months without running into out of memory conditions and now is failing in the new Coherence 3.6 environment in matter of a few weeks on a regular basis. He said that he had heap dumps and logs and asked whether I could take a look at it and troubleshoot it.

Initially, I was skeptical about being able to help this team manager out. After all, I know almost nothing about their application code and in all practical terms, I had no previous development experience with Coherence with the exception that I read the book Oracle Coherence 3.5 by Aleksandar Seovic in the past. My previously participated in testing Coherence performance on VMware and that really did not require me to delve into the Coherence API at all.

Despite these misgivings, I decided to provide my support and told the application team manager that I'll try my best.

The system with problems was a multi-node Coherence cluster. When I took a look at the logs, all of them had these similar verbose GC output:

[GC [1 CMS-initial-mark: 1966076K(1966080K)] 2083794K(2084096K), 0.1923110 secs] [Times: user=0.18 sys=0.00, real=0.19 secs] 
[Full GC [CMS[CMS-concurrent-mark: 1.624/1.626 secs] [Times: user=3.22 sys=0.00, real=1.62 secs] 
 (concurrent mode failure): 1966079K->1966078K(1966080K), 6.6177340 secs] 2084093K->2084082K(2084096K), [CMS Perm : 13617K->13617K(23612K)], 6.6177900 secs] [Times: user=8.21 sys=0.00, real=6.62 secs] 
[Full GC [CMS: 1966078K->1966078K(1966080K), 4.1110330 secs] 2084093K->2084089K(2084096K), [CMS Perm : 13617K->13615K(23612K)], 4.1111070 secs] [Times: user=4.11 sys=0.00, real=4.11 secs] 
[Full GC [CMS: 1966078K->1966078K(1966080K), 4.2973090 secs] 2084092K->2084087K(2084096K), [CMS Perm : 13615K->13615K(23612K)], 4.2973630 secs] [Times: user=4.28 sys=0.00, real=4.30 secs] 
[Full GC [CMS: 1966078K->1966078K(1966080K), 4.1831450 secs] 2084093K->2084093K(2084096K), [CMS Perm : 13615K->13615K(23612K)], 4.1831970 secs] [Times: user=4.18 sys=0.00, real=4.18 secs] 
[Full GC [CMS: 1966078K->1966078K(1966080K), 4.2524850 secs] 2084093K->2084093K(2084096K), [CMS Perm : 13615K->13615K(23612K)], 4.2525380 secs] [Times: user=4.24 sys=0.00, real=4.25 secs] 
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid23607.hprof ...
Heap dump file created [2274434953 bytes in 28.968 secs]

This garbage collection log output tells me that they are using CMS for the JVM GC. The concurrent mode failure entries certainly grabbed my attention. Normally one would fix concurrent mode failures by tuning the CMS initiation occupancy fraction via -XX:CMSInitiatingOccupancyFraction flag, but in this case, looking that the heap numbers in the lines labeled "Full GC" showed that GC could not clean up any memory at all. So this problem could not be solved by GC tuning. By the way, for a great book on tuning garbage collection, I would recommend Charlie Hunt's book Java Performance.

My next step was to take a look a the heap. The heap was slightly over 2 GB, which was expected since Coherence cluster node was each configured with a 2GB heap. Well, that presented a problem for me because I'm still mainly working on a 32-bit Windows laptop. I needed to find a 64-bit system with preferably 4 GB of ram or more to look at this. Once I was able to get such a machine and fired up Eclipse Memory Analyzer Tool (MAT). Once I looked at the heap, it was pretty obvious what was the biggest memory offender. The biggest memory offender was a top level hashmap chewing up 1.6 GB of memory. Delving further into that hash map structure, it reveals that Coherence caching structure is a hash of hashes. Looking at the hashes, I notice that there were over 2000+ items in the top level hash. That would imply that there were over 2000+ caches in the Coherence cluster. Studying each individual cache, I would notice cache names like

  • alpha-FEB-21
  • alpha-FEB-22
  • alpha-FEB-23
  • alpha-FEB-24
  • beta-FEB-21
  • beta-FEB-22
  • beta-FEB-23
  • beta-FEB-24

and so forth. I ask the application team manager whether he expected to have this many caches in the cluster. The application team manager said no; he expected a much smaller set of caches. The application normally destroy caches older than 2 days. The developers provided me their code related to the creation and destruction of caches and I saw the following lines of code and it seems pretty innocuous:

    public static void destroyCache(String name) {
        Collection listOfCacheNames = getListOfCacheNames(name, false);
        Iterator iterator = listOfCacheNames.iterator();
        while (iterator.hasNext()) {
            String name = (String) iterator.next();
            NamedCache namedCache = CacheFactory.getCache(name);
            namedCache.destroy();
        }
    }

I went back to the memory analyzer tool and performed a GC to root analysis and saw the top level object that's holding onto this heap as:

com.tangosol.coherence.component.net.Cluster$IpMonitor @ 0x77ff4b18 

with the label "Busy Monitor" next to it. This line item seems to suggest that there's a monitor lock on this cache. Looking at the Coherence API documentation, I see the following entry:


destroy

void destroy()
Release and destroy this instance of NamedCache.

Warning: This method is used to completely destroy the specified cache across the cluster. All references in the entire cluster to this cache will be invalidated, the cached data will be cleared, and all internal resources will be released.

Caches should be destroyed by the same mechansim in which they were obtained. For example:

  • new Cache() - cache.destroy()
  • CacheFactory.getCache() - CacheFactory.destroyCache()
  • ConfigurableCacheFactory.ensureCache() - ConfigurableCacheFactory.destroyCache()
Except for the case where the application code expicitly allocated the cache, this method should not be called by application code.

Looking at this documentation, we initially thought that since the cache was obtained via CacheFactory and therefore should be destroyed via CacheFactory ergo CacheFactory had a monitor lock on the underlying collections. The code provided by the developers used one mechanism to create the cache and another mechanism to destroy the cache so we presume that was the problem. So I implemented a test script to test out that theory and surprisingly, even destroying via CacheFactory, I still encounter out of memory issues. Only by clearing the cache before destroying the cache was I able to avoid out of memory errors. Here's the script that I developed in Clojure to test my theories:

(import '(org.apache.commons.lang3 RandomStringUtils) 
        '(java.math BigInteger)
        '(java.util Random Date HashMap)
        '(com.tangosol.net NamedCache CacheFactory CacheService Cluster))

(defn random-text [] (RandomStringUtils/randomAlphanumeric 1048576))
(defn random-key [] (RandomStringUtils/randomAlphanumeric 12))
        
(CacheFactory/ensureCluster)
(def buffer (new HashMap))

(defn print-horizontal-line [c] (println (apply str (repeat 80 c))))

(def caches '("alpha" "beta" "gamma" "delta" 
              "epsilon" "zeta" "eta" "theta" 
              "iota" "kappa" "lambda" "mu" "nu"
              "xi", "omicron" "pi" "rho"
              "signma" "tau" "upsilon" "phi", 
              "chi" "psi" "omega"))

(defn load-cache [cache-name n]
    (let [cache (CacheFactory/getCache cache-name)]
         (print-horizontal-line  "=")
         (println "Creating cache : " cache-name)
         (print-horizontal-line  "=")         
         (.clear buffer)
         (dotimes [_ n] (.put buffer (random-key) (random-text)))
         (.putAll cache buffer)))


(defn recreate-oom-problem [cache-name]
    (let [cache (CacheFactory/getCache cache-name)]
         (load-cache cache-name 200)
         (.destroy cache)))
         
(defn try-fix-oom-1 [cache-name]
    (let [cache (CacheFactory/getCache cache-name)]
         (load-cache cache-name 200)
         (CacheFactory/destroyCache cache)))

(defn try-fix-oom-2 [cache-name]
    (let [cache (CacheFactory/getCache cache-name)]
         (load-cache cache-name 200)
         (.clear cache)
         (CacheFactory/destroyCache cache)))
         
; Test run recreation of original problem.  Was able to reproduce OOM issues         ; 
(doseq [cache caches] (recreate-oom-problem cache))

; Surprise! Still have OOM issues
(doseq [cache caches] (try-fix-oom-1 cache))

; No longer have OOM issues, but memory is still leaking (slowly)
(doseq [cache caches] (try-fix-oom-2 cache))
         

However, I still suspect memory leaks, it's just that my memory leak is a lot smaller now. To verify that I had a memory leak, I would run my Clojure test script and the deliberately create and fill a cache without clearing it. I then forced a full garbage collection followed by a heap dump. In memory analyzer tool, I would look up the cache that I did not clear, and list all the incoming references. Then I would look for a HashMap in the incoming references and select one of those and check for outgoing references. And in that outgoing references, I could see that the key contains the name of a cache that I had called CacheFactory.destroyCache() on and the retained heap sizes range anywhere from 24 to 160 with the sizes that seems proportional to the size of the cache name.

In conclusion, it would seem Oracle Coherence does have a memory leak issues with the cache creation and destruction process. If we clear the cache before destroying the cache, I suspect it would be a long time before the memory leak is even noticeable by this particular application.

To verify that this leak did not exist in the older 3.1 version, we ran this test code on and and was unable to reproduce the out of memory errors. We also tested this against Oracle Coherence 3.7.1 and was unable to reproduce the out of memory error. So, it looks like that this memory error is specific to Oracle Coherence 3.6 only.

Throughout this entire process, I thought that the secret sauce that enabled me to quickly learn Coherence, reproduce and troubleshoot the Coherence out of memory problem was Clojure. Clojure allowed me to interactively manipulate Coherence clusters and explore the API, which would have been a lot slower if I had to go through the normal edit-compile-run cycle with plain old Java.