Haskell’s an incredible language but I’m repeatedly struck by (what I perceive as) gaps in its standard libraries (though as proven yesterday I could know them better). For example, there aren’t any cross-platform file path manipulation routines in the style of python’s os.path (though not for long, thanks to Neil Mitchell). Another lacking area seems to be string manipulation.
Today, I needed to split a string up according to some token, where that token is itself a string, not a single character. Thus, a split in the style of python’s strings’ built in split() method. That is, I wanted to be able to do this:
*StringSplit> split ", " "one, two, three, four" ["one","two","three","four"] *StringSplit> split "an" "banana" ["b","","a"]
To my amazement, this isn’t a standard library function, and I couldn’t find hundreds of examples of how to do it via google. I couldn’t even find one! I could find a number of variants where the token is just a single Char, usually utilising takeWhile and the like. Gladly this led me to a reasonable (I hope) solution. This example splits on Chars but in the comments I was introduced to unfoldr, which looked tantalisingly close to what I wanted. All I needed to do now was write the function unfoldr calls. About half an hour and much hoogling later, hey pesto:
module StringSplit where import List -- | String split in style of python string.split() split :: String -> String -> [String] split tok splitme = unfoldr (sp1 tok) splitme where sp1 _ "" = Nothing sp1 t s = case find (t `isSuffixOf`) (inits s) of Nothing -> Just (s, "") Just p -> Just (take ((length p) - (length t)) p, drop (length p) s)
All of the complexity is in sp1 (short for “split once”), which performs a single split. sp1‘s type is String -> String -> Maybe (String, String); the section (sp1 tok) then has the right type for unfoldr (sections just seem to be cropping up everywhere lately, eh?). The point of sp1 is to split s at the first occurrence of t. It does this by using inits to build all prefices of s and finding the first one having t as a suffix. If found, it splits the string at the token, throws the token away, and return the (before, after) pair. If the token isn’t found, it just returns s paired with an empty string; that gets passed by unfoldr to sp1 on the next call, causing sp1 to return Nothing, causing unfoldr to finish – whew!
That last bit sounds a bit complicated, but it was necessary to fix a bug present in my first implementation of sp1:
sp1' t s = case find (t `isSuffixOf`) (inits s) of Nothing -> Nothing Just p -> Just (take ((length p) - (length t)) p, drop (length p) s)
This misses the final substring, so split "n" "banana" would return ["ba", "a"] instead of ["ba", "a", "a"]. Hence the whole empty string/Nothing/stop machinery.
I could not have written this code two weeks ago, I think. Thus, I’m sure there are better ways to do this, and would love to hear them – and any criticism. I was a wee bit worried about using inits and find in this way for finding a substring (no standard substring function?). Haskell’s lazy nature ought to help me here though: inits will only build the entire list of prefices if t isn’t in s at all. For anything I’m planning to use this for, it ought to be fine; it’s probably not the best general way to do it though.
For good measure, since I had to implement it later that evening, here’s a string.strip() (aka trim):
strip :: String -> String strip s = dropWhile ws $ reverse $ dropWhile ws $ reverse s where ws = (`elem` [' ', '\n', '\t', '\r'])
I’m particularly proud of the ws part. :-) Look at me, all full of Haskell pride. ;-)
I realised I can generalise split beyond strings, to work on lists of anything of class Eq:
-- | A function inspired by python's string.split(). A list is split -- on a separator which is itself a list (not a single element). split :: Eq a => [a] -> [a] -> [[a]] split tok splitme = unfoldr (sp1 tok) splitme where sp1 _  = Nothing sp1 t s = case find (t `isSuffixOf`) $ (inits s) of Nothing -> Just (s, ) Just p -> Just (take (length p - length t) p, drop (length p) s)
*StringSplit> split [2,3] [1,2,3,4,3,2,1,2,3,4] [,[4,3,2,1],]