Friday, November 11, 2011

Revisiting the SharePoint collection adapter for F#

Recently, I got to work with SharePoint 2007 again and I revisited a previously written blog Exploring SharePoint 2007 Object Model with F#. In that blog, I complained about the fact that SharePoint collections did not implement IEnumerable, causing me to implement a type specific collection adapter as pilfered from Asfar Sadewa. However, as I started to work with more different SharePoint collections, I realized that I need a generic SharePoint collection adapter. But once again, I was foiled by SharePoint's library designers. I wanted to ensure some level of type safety for my generic collection adapter and wanted to bound the collection item generic type variable so the client code is restricted to types such as SPWeb, SPList, etc. Looking up the hierarchy, I see the type SPSecurableObject. Unfortunately, SharePoint library designers did not make SPSecurableObject visible to others and the next level up that hierarchy is the Object type. So I'm stuck with an unbounded generic type for the collection item in my generic version SharePoint collection adapter. Actually, what I really want is to ensure that the collection item type is consistent with the collection type. I haven't quite figure out if that's even possible to apply that kind of type parameter constraints in C#. Here is generic version of the SharePoint collection adapter (with the runtime type error leakages) implementation:

using System;
using System.Collections.Generic;
using Microsoft.SharePoint;

namespace SharePoint.Utility
{
    // I really wanted to apply constraint to TPart to SPSecurableObject
    public class SPCollectionAdapter<TPart, TCollection> : List<TPart> where TCollection : SPBaseCollection
    {
        private TCollection _listCol;

        public SPCollectionAdapter(TCollection listCol)
        {
            _listCol = listCol;
            Refresh();
        }

        private void Refresh()
        {
            this.Clear();
            this.Capacity = _listCol.Count;

            foreach (TPart item in _listCol)
            {
                this.Add(item);
            }
        }

    }
}

This allows me to write my F# code as follows:

open Microsoft.SharePoint
open SharePoint.Utility

let path="http://localhost/"
let collection = new SPSite(path)
let site = collection.RootWeb
let lists = SPCollectionAdapter<SPList,SPListCollection>(site.Lists)
Seq.iter (fun (x:SPList) -> printf "%s\n" x.Title) lists

Monday, May 09, 2011

Porting the Log Analysis Code to Haskell

My coworker approached me the other day and ask what open source log analysis tools would I recommend. I personally do not have much experience with a general purpose open source log analysis tools so I would have probably recommended him to take a look at Splunk. Since I've recently written a customized log analysis software, I became curious and asked him what he intend to do with the log analysis software.

My coworker said that he needed to analyze Tibco EMS logs. Tibco EMS logs incoming messages in the order it receives. My coworker is interested in a set of related messages that is identified by the message ID tag. His particular issue is that the logs entries that he's interested in are interspersed with other log entries that he's not interested. He wanted a log file where the log entries are grouped by message id in historical order.

Once I understood his needs, I realize that he did not need the Splunk and that I could quickly adapt my F# log analysis software written in the previous blog post for his need. When I gave him the modified F# code, he asked me if I could port it to Linux. That threw me for a loop. I briefly entertained the idea of building a Mono system and compile F# on Mono but decided against it for now. I thought it would be easier to just port it to Haskell, which I already have on Linux.

Here's the ported Haskell log analysis software with modifications to work with Tibco log entries.


import Data.Time.Calendar
import Data.Time.LocalTime
import Data.Time.Parse
import List
import System.Environment

type Category = String
type Entry = [String]
type TimeStamp = (LocalTime,String)
type LogHeader = (TimeStamp, Category)

alphaTime = LocalTime (fromGregorian 2000 1 1) midnight 

data LogEntry = LogEntry (TimeStamp, String) [String]
                deriving (Show)

{- Grab label -}
categorize (_ : _ : label : _) = label
categorize words = ""

{- Grab timestamp -}
timestamp (date : time : _ )  = strptime "%Y-%m-%d %H:%M:%S" (date ++ " " ++ time)
timestamp words  = Nothing
           
{- header :: String -> (String, Maybe (LocalTime, String)) -}           
header line = (timestamp tokens, categorize tokens)    
    where tokens = words line

{- Concrete implementation of Tibco log parser -}    
logparser :: [String] -> LogHeader -> [String] -> [LogEntry] -> [LogEntry]
logparser (line : rest) xheader entry entries  = process (header line)    
    where process (Just (ts),label) = 
              logparser rest h [line] ((LogEntry xheader (reverse entry)):entries ) where h = (ts,label)
          process (Nothing,_) = logparser rest xheader (line : entry) entries

logparser [] xheader entry entries = reverse ((LogEntry xheader entry) : entries) 

{- Utility method  to pull items out of LogEntry -}
entry (LogEntry _ entries) = entries    
category (LogEntry (_,label) _) = label

{- comparator based on category -}
categorysort (LogEntry (_,a) _) (LogEntry (_,b) _) 
    | a > b = GT
    | a < b = LT
    | otherwise = EQ

parselog parser lines = parser lines ((alphaTime,".000"),"STARTFLAG") [] []
    
processlog = unlines
             . map (unlines . entry)
             . sortBy categorysort
             . (parselog logparser)
             . lines 

main = do (filename:_) <- getArgs
          contents <- readFile filename
          putStr (processlog contents)

Monday, March 21, 2011

Log Analysis with F#

I am often pulled into investigative teams to resolve performance issues. In one particular instance, I was involved in troubleshooting a Java-based batch job that had experience performance issue after it migrated from an older database platform to a newer database platform. Typically in these performance related issues, I would analyze the logs to see if I can spot tell tale signs of performance issues. Some of these log files can range up to millions of lines of text. In this particular case, I received a log file with of almost 4 million lines of text. I do not want to wade through these log files by hand. In the past, I would have use a scripting language like perl or ruby to help parse these logs. But lately, I have leaned more towards F# to analyze log files. The following describes my experience with developing my log analysis software in F#.

The log files that I have received have sample entries with the following pattern:

02/04/11 21:15:55 xyz 123 CATEGORY_LABEL Lorem ipsum dolor sit amet 
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
    Aliquam varius mollis purus, vel molestie lorem pellentesque sed. 

where CATEGORY_LABEL categorizes the jobs performed and each log entry always starts with a timestamp. Typically, multiple jobs are running at the same time so the log entries have data interspersed. To start this effort, I want to capture the timestamp information, the category label and the entire log entry. To model this, I created the following type to capture a log entry:


type LogEntry =
    {
        Timestamp : DateTime;
        Category  : string;
        Entry     : string list;
    }

I want to be able to easily access Timestamp and Category label information while retaining the entire log entry for that timestamp/category. I decided to model the log entry as a list of strings. I had hope to leverage list operations later on to perform filtering capability. One of the drawbacks to this model for log entry is if the log file had text that split across 2 lines.

The first problem is to extract the timestamp and the category information. To capture the timestamp information, I implemented the following functions:


let parsedate datestring = DateTime.TryParse(datestring)

let getTimestamp (words:string []) =
    if words.Length > 2 then
        let timestamp = sprintf "%s %s" words.[0] words.[1]
        match parsedate timestamp with
            | true, d -> Some(d)
            | false,_ -> None
    else None

I leveraged F#'s Option type to return either a valid timestamp or None if no timestamp are found. You can see later how I use this as a filter to determine if I'm at the start of a log entry or not. Here's the code to get the category label:

let getCategory (words:string []) =
    if words.Length > 4 then 
        let tokens = words.[4].Split[|':'|]
        tokens.[0]
    // Once in a while, there is log entry with no labels
    else ""

I know that the category label always is the 5th token, but once in a while the category label shows up in the log as CATEGORY_LABEL:Lorem, therefore I have to split the word with the token ":" to extract just the category label.

With these 2 functions, I can now build my getheader function which would extract the timestamp and category label:

    
// Get the timestamp and the job category     
let getHeader (line:string) =
    let words = line.Trim().Split[| ' ' |]
    let timestamp = getTimestamp words
    let category = getCategory words
    (timestamp,category)

I now have all the pieces to be able to build my log parser:


// Utility function to create log entry
let createLogEntry (timestamp,category)  entry =
    {Timestamp = timestamp;  Category=category;  Entry = List.rev entry}      
    
    
// Core parser engine to parse the log files
let rec logparser lines oldheader entry (acc:LogEntry list) =
    match lines with
    | line :: rest -> 
        // Check with we got new timestamp with this line
        let header = getHeader line
        match header with
        | (Some(ts),category) -> 
            // Found new timestamp,create log entry from the collected lines
            let logentry = createLogEntry oldheader (line::entry)
            //  Begin new log entry collections
            logparser rest (ts,category) [] (logentry::acc)
        | (None,_) -> 
            // Add new entry to line
            logparser rest oldheader (line::entry) acc
    // We're done with all line processing!
    | [] -> 
        let logentry = createLogEntry oldheader entry
        logentry :: acc |> List.rev

This log parser would recursively processed each line in the log file and build a list of LogEntry to return. When I find a header, I would begin a new LogEntry. Because I built the list using the :: (cons) operator, I need to reverse the list when I finish building the list. I would use the above parser engine as shown in the following code, which would return to me list of all log entries.

let parselog file parser =
    let lines = File.ReadAllLines(file)
    let past = DateTime.Now
    parser (Array.toList lines) (past,"") [] []

Now I have all the pieces to begin performing log analysis. I typically begin by chopping the log into smaller pieces. In the following example, I'm looking for a specific category on specific dates, one before the database migration and one after the database migration.

let (_,premigration) = DateTime.TryParse("03/11/2011")
let (_,postmigration) = DateTime.TryParse("03/14/2011")

// Extraction engine to pull logs related to specific job @ specific date
let extract targetdate label file processor =
    parselog file logparser
    |> List.filter (fun x -> x.Timestamp.Date.Equals(targetdate))
    |> List.filter (fun x -> x.Category.Trim().Equals(label))
    |> processor

// End log entry text to file
let send2file file entries=
    entries 
    |> List.map (fun x -> x.Entry)
    |> List.concat
    |> List.toArray
    |> writefile file

// smaller working file - premigration for JOB A
(send2file @"JOBA.premigration.log") |>
extract premigration.Date "JOBA" @"jobs.log" 

// smaller working file - postmigration for JOB A
(send2file @"JOBA.postmigration.log") |>
extract postmigration.Date "JOBA" @"jobs.log" 

Here's another example usage where I wanted to grab clusters of log entries along the time dimension (e.g. unique sets of Timestamp,Category pairs).

let getEntryClusters targetdate infile outfile  =
    parselog infile logparser
    |> List.filter (fun x -> x.Timestamp.Date.Equals(targetdate))
    |> List.map (fun entry -> (entry.Timestamp,entry.Category))
    |> Set.ofList |> Set.toList
    |> List.map (fun (ts,category) -> sprintf "%s,%s" (ts.ToString()) category)
    |> List.toArray 
    |> writefile outfile

getEntryClusters postmigration.Date @"jobs.log" @"clusters.log"

The interactive capability of F# in Visual Studio made building this log analysis tool a lot easier than I expected. While I can do similar interactive work with Ruby's irb, or Python's interactive mode, Visual Studio and F# just makes it feel much more productive than either of Ruby or Python's interactive environment. If I had to analyze another log file with a different format, I can certainly reuse my code and maybe just changed the logparser implementation. Indeed, just a few days after later, I was able to apply this code to look into some Tibco logs with minimal amount of changes.