Cyberville

Thursday, October 29, 2009

Vigenère Cipher in Haskell for fun

> import Char

This is a little toy implementation of the Vigenère Cipher - written primarly for autodidactic purposes. The code is based on the Caesar Cipher code of Graham Hutton's book Programming in Haskell. This post is written in literal Haskell, so you should be able to copy paste the whole thing into an *.lhs file and run it.

First the encryption and decryption functions:

> {- Vigenere Encode & Decode -}
> -- converts letter (ASCII encoded) to numeric value
> let2int :: Char -> Int
> let2int c = ord c - ord 'a'
>
> -- converts a numeric value to ASCII
> int2let :: Int -> Char
> int2let n = chr (ord 'a' + n)
>
> -- shift letter by n places modulo 26
> shift :: Int -> Char -> Char
> shift n c | isLower c = int2let ((let2int c + n) `mod` 26)
>           | otherwise = c
>

The Caesar and Vigenère ciphers are closely realted. In a Caesar cipher, each letter of the alphabet is shifted along some number of places. For example, in a Caesar cipher of shift 3, A would become D, B would become E and so on. This is what the shift function is used for. The Vigenère cipher consists of several Caesar ciphers in sequence with different shift values. I used a list comprehension to cycle through the provided key shifting each letter as required.

> -- vigenere encodes the message
> encode :: String -> String -> [Char]
> encode key msg = [ shift (let2int k) c | (k,c) <- zip (cycle key) msg]

The decode function is just the inverse of the encode function. On a sidenote; this would be a nice property to test with quickcheck.

> -- vigenere decodes the message
> decode :: String -> String -> [Char]
> decode key msg = [ shift (26 - let2int k) c | (k,c) <- zip (cycle key) msg]

This takes care of the cryptographic functions in the characteristic concise haskelly way. But the intersting part is the cryptoanalytical part of cracking the cipher. To crack Vigenère, you first have to figure out the length of the key used to encrypt the plain text. There are actually two methods to do this. Either using a statistical method called the Index of Coincidence (IC) or a more mechanial approach called Kasiski's Method. I've only implemented the first approach here, mainly because I'm lazy. Kasiski's method is actually easier to understand and probably more accurate for longer keys than is IC...
Once you know the cipher key length, you can separate the underlying Caesar Cihpers and perform perform individual frequency analysis - one for each position in the key. The last step is then to match the observed frequencies with a table of expected frequencies. This allows you to calculate the key.

> {- Frequency analysis -}
> type FrequencyTable = [Float]
>
> -- frequency table for the english language (expected values)
> table :: FrequencyTable
> table = [8.2, 1.5, 2.8, 4.3, 12.7, 2.2, 2.0,
>           6.1, 7.0, 0.2, 0.8, 4.0, 2.4, 6.7,
>           7.5, 1.9, 0.1, 6.0, 6.3, 9.1, 2.8,
>           1.0, 2.4, 0.2, 2.0, 0.1]
>
> lowers :: String -> Int
> lowers xs = length [x | x <- xs, isLower x]
>
> count :: Char -> String -> Int
> count x xs = length [x' | x' <- xs, x == x']
>
> percent :: Int -> Int -> Float
> percent n m = (fromIntegral n / fromIntegral m) * 100

Freqs calculates the observed frequencies (in percent) of all the characters in the cipher text. Frequency analysis is a very common attack on many pen an paper ciphers.

> freqs :: String -> [Float]
> freqs xs = [percent (count x xs) n | x <- ['a'..'z']]
>             where n = lowers xs
>
>
> indexOfCoincidence :: String -> Float
> indexOfCoincidence xs = fromIntegral (sum [f * (f-1) | f <- counts]) / fromIntegral (n * (n-1))
>                         where
>                            counts = [(count x xs) | x <- ['a'..'z']]
>                            n = sum counts

The function passwordLength deduces the likely password length for english texts based on the Index of Coincidence. This is sometimes also called the Friedman test. There's a lot of statistical magic going on behind the scenes and the details are beyond the scope of this text. As I mentioned earlier IC suffers from some weaknesses, first and foremost it gets inaccurate the longer the cipher key is.

> passwordLength :: String -> Int
> passwordLength xs | ic > 0.06552 = 0
>                   | ic > 0.05185 = 1
>                   | ic > 0.04730 = 2
>                   | ic > 0.04502 = 3
>                   | ic > 0.04365 = 4
>                   | ic > 0.04274 = 5
>                   | ic > 0.04209 = 6
>                   | ic > 0.04160 = 7
>                   | ic > 0.04122 = 8
>                   | ic > 0.04092 = 9
>                   | otherwise     = 10
>                   where
>                      ic = indexOfCoincidence xs

Once you know length of the key, you can proceed to calculate the frequencies for each cipher key character. To find the best match between the expected frequencies in the FrequencyTable and the observed frequencies in the cipher text, a statistical method called Chi Square Test is used.

> -- Chi square for the observed frequencies
> chisqr :: [Float] -> [Float] -> Float
> chisqr os es = sum [((o - e) ^ 2) / e | (o,e) <- zip os es]
>
> rotate :: Int -> [a] -> [a]
> rotate n xs = drop n xs ++ take n xs
>
> positions :: Eq a => a -> [a] -> [Int]
> positions x xs = [i | (x',i) <- zip xs [0..], x == x']
>

Bestmatch cycles through the all possible shifts (zero to 25) and returns the most likely number of positions the alphabet was shifted, i.e. the one with the lowest Chi-Square value.

> bestmatch :: String -> Int
> bestmatch xs = head (positions (minimum chitab) chitab)
>                where
>                   chitab = [ chisqr (rotate n table') table | n <- [0..25]]
>                   table' = freqs xs
>
>
> separateAlphabets :: String -> Int -> [String]
> separateAlphabets xs n = [[ c | (c,p) <- zip xs (cycle [0..n-1]), p `mod` n == i] | i <- [0..n-1]]
>
> findKey :: String -> String
> findKey msg = [ int2let (bestmatch s) | s <- alphas ]
>               where
>                  alphas = separateAlphabets msg cnt
>                  cnt = passwordLength msg

The crack function combines all of the above into another Haskell one-liner:

> crack :: String -> String
> crack cipherText = decode (findKey cipherText) cipherText

Monday, October 19, 2009

A Neighborhood of Infinity: You Could Have Invented Monads!

Another gem from the 'Neighborhood of infinity' - one of my favorite blogs. Definitely the best, most straight forward explanation of the concept of Monads in Haskell I have come across so far. Illustrating that monads are not the incomprehensible, magical, mystical things they are so often depicted to be. On the contrary, Monads are a naturally arising pattern when you're programming in a purely functional setting. Using simple examples, Dan goes on to show that seemingly unrelated problems can all be solved using monads. A truly brilliant article that every Haskell beginner trying to wrap their head around Monads should read.

Tuesday, March 17, 2009

Don't fear the Monads

Dr. Brian Beckman, a Channel 9 celebrity, astrophysicist and senior software engineer thought it would be a very good idea to address the complexity of monads in an easy to understand way: a technical conversation at the whiteboard with yours truly for Channel 9.

Intro to FP and Monads in general and Haskell in particular. You'll have to remember high school algebra, tho :) He tries to explain this without going too much into the mathematical details:

Just think a monoid satisfies all the axioms of a group with the exception of having inverses.

Thursday, February 19, 2009

John Goerzen on Why You Should Learn Haskell

Where does laziness help real world programs? The traditional explanations I've seen in beginning tutorials and calculate Fibonacci, which is not that interesting.

Goerzen: One of the first things that I thought was interesting was how laziness relates to IO in Haskell. In Haskell there's a function called getContents, and getContents returns a string. That string represents the entire contents of standard input or of a file or whatever, but it doesn't just read the whole thing into memory all at once. It also doesn't mmap. In other languages, if you want a string that represents an entire file, you're either going to mmap the thing or you're going to have to waste a lot of memory, or you read it in blocks or lines or whatever else you might do. If I write a little filter or parser program in Haskell, I don't have to really worry about the details of reading the input line-by-line or block-by-block because I can just use getContents and then I can string together a whole bunch of functions. I like writing Unix filters in Haskell.

That doesn't sound so extraordinary. After all, imperative languages also offer some pretty nice abstractions. The coolest aspect is not how Haskell abstracts the IO, but that lazy IO allows you to completely restructure your programs. You can (and should) separate the code that deals with the IO from code that works on the contents of a file - any kind of resource actually. In an imperative setting you tend to have a control structure that loops through the contents of a file chunk by chunk and in the middle of that you have a code block that processes each chunk of the contents. These two blocks are essentially unrelated. It had never even occurred to me until I saw the Haskell way of performing input output. It's much cleaner and also much more composible. You can simply plug in another processing function if it should become necessary at some point in the future. That is as long as the types match. And if they don't, the compiler will warn you about it preventing nasty runtime exceptions.

Thursday, January 1, 2009

FoxTrot

This seems to be the most stumbled FoxTrot strip.

Note however, that he forgot the trailing \n in the printf call. tsk tsk tsk... Also, I think Jason would realize how powerful Haskell is and it rather than C.

So here's my proposition in Haskell:
sequence_ $ replicate 500 $ putStrLn "I will not throw paper airplanes in class."