Monday, May 21, 2007

Checking duplicate lines in functional programming

Recently at work, I found the need to remove duplicate lines from a file and sort the result. I've must have rewritten this hundreds of times and in a variety of programming languages: java,c#,perl,ruby,vb.

I know that you can do this easily if you have unix shell and the unix tools. Then this can be done with a simple

sort dups.txt | uniq

However, I want to code and my latest fancy is with functional programming language. So my latest iteration of the code is done in Haskell. Haskell took me a while to grok. It really warped my way of thinking about how to program. Reminds me of the first time that I had to develop proofs in math in high school. I'm slowly getting the hang of programming simple stuffs in Haskell.

Code written in Haskell

import List

main = readFile "dups.txt" >>=
(\fileContents -> putStr (unlines (sort (nub (lines fileContents)))))

breaks the file content into lines
removes the duplicate
combines the lines into one giant string

Here's how I would have implemented it in other languages:

Code written in Perl

open(FIN,"< dups.txt");
chomp (@lines);
my %uniques;
foreach $line (@lines)
foreach $key (sort keys %uniques) {
    print $key,"\n";

Code written in Ruby"dups.txt") { |file|
    lines = file.readlines
    lines.sort.each { |line|
        puts line

Code written in F#

Okay, for self enlightment, I tried to implement this in F#,which is mostly OCaml with interop to the .Net Framework. Maybe it's me but OCaml's mixed imperative and functional style confuses me or maybe I just don't have a really good grasp of OCaml syntax. In any case, it took me a while to figure out how to do this in OCaml.
open System.IO;;

let main() =
    let lines = File.ReadAllLines("dups.txt")
    let s = List.fold_right Set.add (Array.to_list lines) Set.empty
    Set.iter print_endline s

do main()

Modified on 11/23/2007...after working in F# for a while, I can rewrite the above F# code as

open System.IO;;

|> Set.of_array
|> Set.iter print_endline

As a developer, I love the code density of Haskell. It's also readable so you are not sacrificing code readability for code density.