Cyberville: 2009

Friday, December 25, 2009

Determining the status of Portupdate

I've been fiddling around with the new FreeBSD 8 recently. The surface of the OS hasn't changed that much over time. However, 8.0 introduced a couple of major changes under the hood. Anyway, part of the installation process involved a lot of compiling (and recompiling due to the neglect to consult UPDATING, silly me).
I manage my ports mainly through portupgrade. One of the things that are a bit annoying about it is that you don't know how many ports are affected by portupgrade -a.
To solve this, I've added another little gem to my toolbox. The tool checks the progress of either portinstall or portupdate.

The Idea

After a little goggling, I found that the tried and trusted UNIX ps displays all the information I need. I'm sure you used it a million times blissfully ignorant about its full capabilities, I know I have. Rather than the typical ps aux, I will use ps -ao 'lstart etime command'. This will spit out the date and time a process started, how long it's been running (mnemonic etime = elapsed time) and command which contains the current port and how many ports will be installed or updated respectively. So basically the Idea is call ps, parse the output, find the portinstall/portupdate process and format the output a bit more nicely, that's it.

The Tool

So I'm gonna write such a thing in Haskell. If you're absolutely against using Haskell, I also hacked up an equivalent Perl version available here.

> import System (getArgs)
> import System.Console.GetOpt
> import System.Process (createProcess, proc, std_out, StdStream(..))
> import System.Time (CalendarTime(..), Day(..), Month(..))
> import GHC.IO (hGetContents)
> import Text.Printf (printf)
>
> import Control.Applicative
> import Control.Monad (MonadPlus(..), ap)
> -- Hide a few names that are provided by Applicative.
> import Text.ParserCombinators.Parsec hiding (many, optional, (<|>))
> -- The Applicative instance for every Monad looks like this.
> instance Applicative (GenParser s a) where
>     pure = return
>     (<*>) = ap
>
> -- The Alternative instance for every MonadPlus looks like this.
> instance Alternative (GenParser s a) where
>     empty = mzero
>     (<|>) = mplus

Calling System Processes

First of all, we'll have to call ps. I will use the System.Process module. Creating a subprocess to make a call to ps is simple.

For example, to execute a simple ls command:

   r <- createProcess (proc "ls" [])

To create a pipe from which to read the output of ls:

   (_, Just hout, _, _) <-
       createProcess (proc "ls" []){ std_out = CreatePipe }

Note: the program will blow up in your face if createProcess fails to create a handle to stdout for some reason. In that unfortunate case the pattern won't match ('fail' will be called and you certainly don't wanna go there).

Note further: std_out = CreatePipe redirects stdout to the handle called 'hout' if you don't do that, every output will be written to the console.

CallPsWith is a simple wrapper around the call that allows you to call ps with the specified formatting.

> callPsWith :: String -> IO String
> callPsWith formattingOptions = do
>    (_, Just hOut, _, _) <- createProcess (proc "ps" ["-ao " ++ formattingOptions]) { std_out = CreatePipe }
>    hGetContents hOut

Technically only one call to ps would be necessary (formatted as 'lstart etime command'). It would contain all the information I'm interested in. However, I will make two calls to ps because ps won't break the line if the formatted output is too long. Usually the command string is truncated because the terminal is not wide enough.
So to avoid this, I will make one call with the arguments '-ao pid command' to find the running portinstall process and its PID. The second call will be with the arguments '-ao pid lstart etime'. And I'll try to find the lstart and etime with the matching PID.

Parsing

Here are some instance declarations to make Parsec an instance of Applicative. I'm gonna use quite a bit of applicative Parsec. I used to have real problems wrapping my head around applicative functors. If you have the same trouble, I highly suggest reading the chapter about Applicative Functors on LearnYouAHaskell.com.

Eventually the goal is to parse the output of ps. This being Haskell (rather than Perl), I want to parse the output to handy datatypes that I can use in other parts of the program. So I defined suitable data types to hold the output of ps:

> data ElapsedTime = ElapsedTime {
>     hours :: Int,
>     minutes :: Int,
>     seconds :: Int
> } deriving (Show, Eq, Ord)
>
> data Command = Command {
>     portNum :: Int,
>     portCount :: Int,
>     portName :: String
> } deriving (Show, Eq, Ord)

You may wonder where the datatype to store the starting time went. I'm gonna use the CalendarTime from the System.Time module. This is a bit overkill for such a small program, but it'll demonstrate how to parse a little bit more complex type.

But first, here are two helper parsers that I will use heavily as part of other parsers. The first is skipSpaces and it does just that. The other one, p_Int, reads a sequence of digits and converts it to an Int. Note that Int will overflow if the sequence is too long. Eg. read "4000000000"::Int returns -294967296 on my machine. However, I know that I will only really use it to parse PIDs or dates and times. So there is no danger of an overflow.

> skipSpaces :: CharParser () String
> skipSpaces = many (oneOf " ")

Also note that the typesystem automatically coerces the call to read to the type declared in the function signature. You just love the type system!

> p_Int :: CharParser () Int
> p_Int   = read <$> many1 (oneOf ['0'..'9'])

>
> p_month :: CharParser () Month
> p_month = January <$ try (string "Jan")
>        <|> February <$ string "Feb"
>          <|> March <$ try (string "Mar")
>              <|> April        <$ try (string "Apr")
>          <|> May        <$ string "May"
>            <|> June       <$ try (string "Jun")
>          <|> July     <$ string "Jul"
>        <|> August       <$ string "Aug"
>          <|> September <$ string "Sep"
>          <|> October <$ string "Oct"
>        <|> November <$ string "Nov"
>        <|> December <$ string "Dec"
>        <?> "Failed to parse month"
>
> p_day :: CharParser () Day
> p_day   = Monday      <$ string "Mon"
>         <|> Tuesday <$ try (string "Tue")
>         <|> Wednesday <$ string "Wed"
>         <|> Thursday    <$ string "Thu"
>     <|> Friday   <$ string "Fri"
>         <|> Saturday <$ try (string "Sat")
>         <|> Sunday    <$ string "Sun"
>         <?> "Failed to parse day"

These are all building blocks required to parse the CalendarTime from the System.Time module. The thing to keep in mind is that each line with a '<$' has to be read from right to left. E.g. in the first line of p_month, January <$ try (string "Jan"). The data constructor January is returned if the string that is parsed contains "Jan". Also note the 'try' function. This is Parsec's look-ahead function. It modifies the behavior of the string matching. Parsec tries to match the string, but if it fails, it restores the original string that is being parsed. This is important if you have several patterns that start identically.
Let's say I omitted 'try', and I'd attempt parse the string "Jun". Parsec would see the letter "J" and it would attempt to match against the first matching string. That's 'string "Jan" in this case. So it would consume the letter "J" and continue. It then expects to see the next letter "a". But the string contains "un" (remember "J" has already been consumed). At that point it would fail with an exception saying something along the lines of "expected 'a' but found 'u'". So even though "Jun" is a valid option, Parsec couldn't properly parse it without 'try'. On the other hand, Parsec restores the string to "Jun" if 'string "Jan"' fails and the continues with the next viable option.
I suggest reading the chapter about Parsec on RealWorldHaskell, it explains these and more Parsec functions in detail.

By the way, you can run any of those parsers directly in GHCi if you like to experiment.
parse p_month "(error)" "Jan and some random string you want to parse"

We can use the parsers we've seen so far as building blocks to construct a bigger, more powerful parser that can parse the ps' date string. That date string looks like this "Mon Dec 20 17:40:06 2009". This time, you have to read each line from left to right. For example (p_day <* skipSpaces) parses the day and ignores any whitespaces. The parsed day will then be the first parameter of 'toCTime'. I defined toCTime as a wrapper around the CalendarTome constructor for two reasons. Firstly not all parameters are required. Second and more importantly, by reordering the in the parameters of the toElapsedTime wrapper, I can make use the applicative functions (<$>) and (<*>) instead of the do-notation.

> p_time :: CharParser () CalendarTime
> p_time = toCTime <$> (p_day <* skipSpaces)
>                    <*> (p_month <* skipSpaces)
>                    <*> (p_Int <* skipSpaces)
>                    <*> (p_Int <* char ':')
>                    <*> (p_Int <* char ':')
>                    <*> (p_Int <* skipSpaces)
>                    <*> (p_Int <* skipSpaces)
>    where toCTime d m mday th tm ts y = CalendarTime y m mday th tm ts 0 d 0 "UTC" 0 False

The elapsed time looks like this: "01:13:13". etime is a bit tricky because it has a variable length. For example if the process has been running for 14 seconds, etime would be displayed as just "14". If the process has been running for five minutes, it would be displayed as "05:00".

> p_elapsed_time :: CharParser () ElapsedTime
> p_elapsed_time = toElapsedTime <$> sepBy p_Int (char ':')
>    where
>     toElapsedTime (h:m:s:_) = ElapsedTime h m s
>     toElapsedTime [m,s]     = ElapsedTime 0 m s
>     toElapsedTime [s]       = ElapsedTime 0 0 s
>     toElapsedTime []        = ElapsedTime 0 0 0

A command will look like this "ruby18: portinstall: [2/3] multimedia/win32-codecs (ruby18)". You can see that there are three ports in total that will be installed. The win32-codecs video codecs are being installed at the moment. The Ruby stuff is due to the fact that portinstall is written in ruby and is always shown for every port. We can ignore that.
The command parser looks like this:

> p_command :: String -> CharParser () Command
> p_command progName = Command <$> (string "ruby" *> p_Int *> string ": " *> string progName *> string ": [" *> p_Int)
>                              <*> (char '/' *> p_Int <* string "] ")
>                              <*> (many (noneOf " "))
>
> p_pid :: String -> CharParser () (Int, Command)
> p_pid progName = (,) <$> (skipSpaces *> p_Int) <*> (skipSpaces *> p_command progName)
>
> p_times :: Int -> CharParser () (CalendarTime, ElapsedTime)
> p_times pid = (,) <$> (skipSpaces *> (string (show pid)) *> skipSpaces *> p_time) <*> (skipSpaces *> p_elapsed_time)
>
>
> tryParse :: CharParser () a -> [String] -> Maybe a
> tryParse _ [] = Nothing
> tryParse p_line (l:ls) =
>    case parse p_line "Error While Parsing" l of
>       Left _      -> tryParse p_line ls
>       Right match -> Just match
>
> lookupPidFor :: String -> [String] -> Maybe (Int, Command)
> lookupPidFor programName = tryParse (p_pid programName)
>
> lookupStatusOf :: [String] -> (Int, Command) -> Maybe (CalendarTime, ElapsedTime, Command)
> lookupStatusOf linesToParse (pid, command) = tryParse (p_times pid) linesToParse >>= \(ct, et) ->
>    Just (ct, et, command)
>

Formatting

I use printf from the Text.Printf package to format the final output. It behaves exactly like the C counterpart.

> formatStatus :: String -> CalendarTime -> ElapsedTime -> Command -> String
> formatStatus progName start elapsed command =
>    printf "%s status:\n\tstarted on %s %s %d at %02d:%02d:%02d (elapsed: %02d:%02d:%02d)\n\tworking on %s (%d of %d)"
>            progName wday month day sHour sMin sSec eHour eMin eSec port pNum pCount
>       where wday   = (show . ctWDay) start
>             month = (show . ctMonth) start
>             day    = ctDay start
>             sHour = ctHour start
>             sMin   = ctMin start
>             sSec   = ctSec start
>             eHour = hours elapsed
>             eMin   = minutes elapsed
>             eSec   = seconds elapsed
>             port   = portName command
>             pNum   = portNum command
>             pCount = portCount command

Input Handling

Handling user input trough the GetOpt package is easy. I barely scratched the surface. Two features that would be handy even in this simple app, would be the declaration of required and/or mutually exclusive options. I want that the user picks either '-i' or '-u'. But I couldn't enforce it through GepOpt. It fails if you don't pick any option and if you pick more than one, it will silently go with the first one.

> data Opts = Opts { program :: String } deriving Show
>
> options :: [OptDescr Opts]
> options = [Option ['u'] ["portupdate"] (NoArg (Opts {program = "portupgrade"})) "show status of portupdate",
>             Option ['i'] ["portinstall"] (NoArg (Opts {program = "portinstall"})) "show status of portinstall"]
>
> parseOpts :: [String] -> IO ([Opts])
> parseOpts args =
>     case getOpt RequireOrder options args of
>         ([], _, []) -> fail ("No option selected " ++ usageInfo header options)
>         (opts, _, []) -> return opts
>         (_, _, errs) -> fail (concat errs ++ usageInfo header options)
>     where
>         header = "Usage: portstatus [OPTION...]"

Putting it all together

> main :: IO ()
> main = do
>    optList <- parseOpts =<< getArgs
>    let programName = program $ head optList
>
>    psPids <- callPsWith "pid command"
>    psTimes <- callPsWith "pid lstart etime"
>
>    case (lookupPidFor programName (lines psPids)) >>= (lookupStatusOf (lines psTimes)) of
>       Nothing -> putStrLn $ "Couldn't find a running instance of " ++ programName
>       Just (lstart, etime, command) -> putStrLn $ formatStatus programName lstart etime command

Thursday, November 26, 2009

Early Christmas gifts this year - New FreeBSD & Haskell versions

A lot of interesting news this week:

In the Haskell camp, Haskell 2010 was officially announced by Simon Marlow! This is the first update to language specification since Haskell 98 over a decade ago. The language committee has been working on this for the last four or so years. What's more, from now on there will be an annual release schedule to keep the language up to date.

Meanwhile the FreeBSD guys released version 8.0 early this week. It hasn't been actually announced yet, but it is already available in the repositories. Also distributions have been mirrored internationally.

Coincidentially, I've recently started working on a port of the cabal-install command to FreeBSD. So that I can leverage the full power of my favorite programming language on my favorite operating system.

Tuesday, November 3, 2009

Multicore Haskell Now!

I just looked through slides of Dons latest talk about concurrent and parallel programming in Haskell.

Shamelessly lifted from the Abstract:
Multicore computers are here: is your programming language ready for it? Haskell is: you can take an off-the-shelf copy of GHC and write high performance parallel programs right now.

If you want to program a parallel machine, a purely functional language such as Haskell is a good choice: purity ensures the language is by-default safe for parallel execution, (whilst traditional imperative languages are by-default unsafe). This foundation has enabled Haskell to become something of a melting pot for high level approaches to concurrent and parallel programming, all available with an industrial strength compiler and language toolchain, available now for mainstream multicore programming.

I wish someone would record these talks and put them on Youtube! The talk outlines the various ways to get concurrency and parallelism strategies available in Haskell.

More details about concurrency can be found on the Haskellwiki page.

There is also a whole chapter in the fantastic Real World Haskell devoted to Concurrent and Multicore Programming featuring explicit threads using forkIO, synchronized shared variables in form of MVars. Moreover, there's examples of implicit parallelism - called sparks - showing the application of par & pseq combinators.

Oh btw, there's also a complete chapter about Software Transactional Memory. This is really intriguing. The approaches above relied on explicit locking of shared variables. MVars behave similar to mutexes and semaphores in procedural programming. And they suffer from the exact same problems: deadlocks, race conditions, uncaught exception etc. STM provides a way perform database-like (i.e. ACID) transactions on shared variables including retries and rollbacks. You never have to worry about your shared data again. STM is not unique to Haskell, but for some reason it hasn't caught on in the mainstream yet.

These two chapters cover pretty much everything mentioned in the slides. So you should get a clear idea about concurrency in Haskell.

Thursday, October 29, 2009

Vigenère Cipher in Haskell for fun

> import Char

This is a little toy implementation of the Vigenère Cipher - written primarly for autodidactic purposes. The code is based on the Caesar Cipher code of Graham Hutton's book Programming in Haskell. This post is written in literal Haskell, so you should be able to copy paste the whole thing into an *.lhs file and run it.

First the encryption and decryption functions:

> {- Vigenere Encode & Decode -}
> -- converts letter (ASCII encoded) to numeric value
> let2int :: Char -> Int
> let2int c = ord c - ord 'a'
>
> -- converts a numeric value to ASCII
> int2let :: Int -> Char
> int2let n = chr (ord 'a' + n)
>
> -- shift letter by n places modulo 26
> shift :: Int -> Char -> Char
> shift n c | isLower c = int2let ((let2int c + n) `mod` 26)
>           | otherwise = c
>

The Caesar and Vigenère ciphers are closely realted. In a Caesar cipher, each letter of the alphabet is shifted along some number of places. For example, in a Caesar cipher of shift 3, A would become D, B would become E and so on. This is what the shift function is used for. The Vigenère cipher consists of several Caesar ciphers in sequence with different shift values. I used a list comprehension to cycle through the provided key shifting each letter as required.

> -- vigenere encodes the message
> encode :: String -> String -> [Char]
> encode key msg = [ shift (let2int k) c | (k,c) <- zip (cycle key) msg]

The decode function is just the inverse of the encode function. On a sidenote; this would be a nice property to test with quickcheck.

> -- vigenere decodes the message
> decode :: String -> String -> [Char]
> decode key msg = [ shift (26 - let2int k) c | (k,c) <- zip (cycle key) msg]

This takes care of the cryptographic functions in the characteristic concise haskelly way. But the intersting part is the cryptoanalytical part of cracking the cipher. To crack Vigenère, you first have to figure out the length of the key used to encrypt the plain text. There are actually two methods to do this. Either using a statistical method called the Index of Coincidence (IC) or a more mechanial approach called Kasiski's Method. I've only implemented the first approach here, mainly because I'm lazy. Kasiski's method is actually easier to understand and probably more accurate for longer keys than is IC...
Once you know the cipher key length, you can separate the underlying Caesar Cihpers and perform perform individual frequency analysis - one for each position in the key. The last step is then to match the observed frequencies with a table of expected frequencies. This allows you to calculate the key.

> {- Frequency analysis -}
> type FrequencyTable = [Float]
>
> -- frequency table for the english language (expected values)
> table :: FrequencyTable
> table = [8.2, 1.5, 2.8, 4.3, 12.7, 2.2, 2.0,
>           6.1, 7.0, 0.2, 0.8, 4.0, 2.4, 6.7,
>           7.5, 1.9, 0.1, 6.0, 6.3, 9.1, 2.8,
>           1.0, 2.4, 0.2, 2.0, 0.1]
>
> lowers :: String -> Int
> lowers xs = length [x | x <- xs, isLower x]
>
> count :: Char -> String -> Int
> count x xs = length [x' | x' <- xs, x == x']
>
> percent :: Int -> Int -> Float
> percent n m = (fromIntegral n / fromIntegral m) * 100

Freqs calculates the observed frequencies (in percent) of all the characters in the cipher text. Frequency analysis is a very common attack on many pen an paper ciphers.

> freqs :: String -> [Float]
> freqs xs = [percent (count x xs) n | x <- ['a'..'z']]
>             where n = lowers xs
>
>
> indexOfCoincidence :: String -> Float
> indexOfCoincidence xs = fromIntegral (sum [f * (f-1) | f <- counts]) / fromIntegral (n * (n-1))
>                         where
>                            counts = [(count x xs) | x <- ['a'..'z']]
>                            n = sum counts

The function passwordLength deduces the likely password length for english texts based on the Index of Coincidence. This is sometimes also called the Friedman test. There's a lot of statistical magic going on behind the scenes and the details are beyond the scope of this text. As I mentioned earlier IC suffers from some weaknesses, first and foremost it gets inaccurate the longer the cipher key is.

> passwordLength :: String -> Int
> passwordLength xs | ic > 0.06552 = 0
>                   | ic > 0.05185 = 1
>                   | ic > 0.04730 = 2
>                   | ic > 0.04502 = 3
>                   | ic > 0.04365 = 4
>                   | ic > 0.04274 = 5
>                   | ic > 0.04209 = 6
>                   | ic > 0.04160 = 7
>                   | ic > 0.04122 = 8
>                   | ic > 0.04092 = 9
>                   | otherwise     = 10
>                   where
>                      ic = indexOfCoincidence xs

Once you know length of the key, you can proceed to calculate the frequencies for each cipher key character. To find the best match between the expected frequencies in the FrequencyTable and the observed frequencies in the cipher text, a statistical method called Chi Square Test is used.

> -- Chi square for the observed frequencies
> chisqr :: [Float] -> [Float] -> Float
> chisqr os es = sum [((o - e) ^ 2) / e | (o,e) <- zip os es]
>
> rotate :: Int -> [a] -> [a]
> rotate n xs = drop n xs ++ take n xs
>
> positions :: Eq a => a -> [a] -> [Int]
> positions x xs = [i | (x',i) <- zip xs [0..], x == x']
>

Bestmatch cycles through the all possible shifts (zero to 25) and returns the most likely number of positions the alphabet was shifted, i.e. the one with the lowest Chi-Square value.

> bestmatch :: String -> Int
> bestmatch xs = head (positions (minimum chitab) chitab)
>                where
>                   chitab = [ chisqr (rotate n table') table | n <- [0..25]]
>                   table' = freqs xs
>
>
> separateAlphabets :: String -> Int -> [String]
> separateAlphabets xs n = [[ c | (c,p) <- zip xs (cycle [0..n-1]), p `mod` n == i] | i <- [0..n-1]]
>
> findKey :: String -> String
> findKey msg = [ int2let (bestmatch s) | s <- alphas ]
>               where
>                  alphas = separateAlphabets msg cnt
>                  cnt = passwordLength msg

The crack function combines all of the above into another Haskell one-liner:

> crack :: String -> String
> crack cipherText = decode (findKey cipherText) cipherText

Monday, October 19, 2009

A Neighborhood of Infinity: You Could Have Invented Monads!

Another gem from the 'Neighborhood of infinity' - one of my favorite blogs. Definitely the best, most straight forward explanation of the concept of Monads in Haskell I have come across so far. Illustrating that monads are not the incomprehensible, magical, mystical things they are so often depicted to be. On the contrary, Monads are a naturally arising pattern when you're programming in a purely functional setting. Using simple examples, Dan goes on to show that seemingly unrelated problems can all be solved using monads. A truly brilliant article that every Haskell beginner trying to wrap their head around Monads should read.

Tuesday, March 17, 2009

Don't fear the Monads

Dr. Brian Beckman, a Channel 9 celebrity, astrophysicist and senior software engineer thought it would be a very good idea to address the complexity of monads in an easy to understand way: a technical conversation at the whiteboard with yours truly for Channel 9.

Intro to FP and Monads in general and Haskell in particular. You'll have to remember high school algebra, tho :) He tries to explain this without going too much into the mathematical details:

Just think a monoid satisfies all the axioms of a group with the exception of having inverses.

Thursday, February 19, 2009

John Goerzen on Why You Should Learn Haskell

Where does laziness help real world programs? The traditional explanations I've seen in beginning tutorials and calculate Fibonacci, which is not that interesting.

Goerzen: One of the first things that I thought was interesting was how laziness relates to IO in Haskell. In Haskell there's a function called getContents, and getContents returns a string. That string represents the entire contents of standard input or of a file or whatever, but it doesn't just read the whole thing into memory all at once. It also doesn't mmap. In other languages, if you want a string that represents an entire file, you're either going to mmap the thing or you're going to have to waste a lot of memory, or you read it in blocks or lines or whatever else you might do. If I write a little filter or parser program in Haskell, I don't have to really worry about the details of reading the input line-by-line or block-by-block because I can just use getContents and then I can string together a whole bunch of functions. I like writing Unix filters in Haskell.

That doesn't sound so extraordinary. After all, imperative languages also offer some pretty nice abstractions. The coolest aspect is not how Haskell abstracts the IO, but that lazy IO allows you to completely restructure your programs. You can (and should) separate the code that deals with the IO from code that works on the contents of a file - any kind of resource actually. In an imperative setting you tend to have a control structure that loops through the contents of a file chunk by chunk and in the middle of that you have a code block that processes each chunk of the contents. These two blocks are essentially unrelated. It had never even occurred to me until I saw the Haskell way of performing input output. It's much cleaner and also much more composible. You can simply plug in another processing function if it should become necessary at some point in the future. That is as long as the types match. And if they don't, the compiler will warn you about it preventing nasty runtime exceptions.

Thursday, January 1, 2009

FoxTrot

This seems to be the most stumbled FoxTrot strip.

Note however, that he forgot the trailing \n in the printf call. tsk tsk tsk... Also, I think Jason would realize how powerful Haskell is and it rather than C.

So here's my proposition in Haskell:
sequence_ $ replicate 500 $ putStrLn "I will not throw paper airplanes in class."