Python-style string split (and strip / trim) in Haskell

Haskell’s an incredible language but I’m repeatedly struck by (what I perceive as) gaps in its standard libraries (though as proven yesterday I could know them better). For example, there aren’t any cross-platform file path manipulation routines in the style of python’s os.path (though not for long, thanks to Neil Mitchell). Another lacking area seems to be string manipulation.

Today, I needed to split a string up according to some token, where that token is itself a string, not a single character. Thus, a split in the style of python’s strings’ built in split() method. That is, I wanted to be able to do this:

*StringSplit> split ", " "one, two, three, four"
["one","two","three","four"]
*StringSplit> split "an" "banana"
["b","","a"]

To my amazement, this isn’t a standard library function, and I couldn’t find hundreds of examples of how to do it via google. I couldn’t even find one! I could find a number of variants where the token is just a single Char, usually utilising takeWhile and the like. Gladly this led me to a reasonable (I hope) solution. This example splits on Chars but in the comments I was introduced to unfoldr, which looked tantalisingly close to what I wanted. All I needed to do now was write the function unfoldr calls. About half an hour and much hoogling later, hey pesto:

module StringSplit where

import List

-- | String split in style of python string.split()
split :: String -> String -> [String]
split tok splitme = unfoldr (sp1 tok) splitme
    where sp1 _ "" = Nothing
          sp1 t s = case find (t `isSuffixOf`) (inits s) of
                      Nothing -> Just (s, "")
                      Just p -> Just (take ((length p) - (length t)) p,
                                      drop (length p) s)

All of the complexity is in sp1 (short for “split once”), which performs a single split. sp1‘s type is String -> String -> Maybe (String, String); the section (sp1 tok) then has the right type for unfoldr (sections just seem to be cropping up everywhere lately, eh?). The point of sp1 is to split s at the first occurrence of t. It does this by using inits to build all prefices of s and finding the first one having t as a suffix. If found, it splits the string at the token, throws the token away, and return the (before, after) pair. If the token isn’t found, it just returns s paired with an empty string; that gets passed by unfoldr to sp1 on the next call, causing sp1 to return Nothing, causing unfoldr to finish – whew!

That last bit sounds a bit complicated, but it was necessary to fix a bug present in my first implementation of sp1:

sp1' t s = case find (t `isSuffixOf`) (inits s) of
               Nothing -> Nothing
               Just p -> Just (take ((length p) - (length t)) p,
                               drop (length p) s)

This misses the final substring, so split "n" "banana" would return ["ba", "a"] instead of ["ba", "a", "a"]. Hence the whole empty string/Nothing/stop machinery.

I could not have written this code two weeks ago, I think. Thus, I’m sure there are better ways to do this, and would love to hear them – and any criticism. I was a wee bit worried about using inits and find in this way for finding a substring (no standard substring function?). Haskell’s lazy nature ought to help me here though: inits will only build the entire list of prefices if t isn’t in s at all. For anything I’m planning to use this for, it ought to be fine; it’s probably not the best general way to do it though.

Postscript

For good measure, since I had to implement it later that evening, here’s a string.strip() (aka trim):

strip :: String -> String
strip s = dropWhile ws $ reverse $ dropWhile ws $ reverse s
    where ws = (`elem` [' ', '\n', '\t', '\r'])

I’m particularly proud of the ws part. :-) Look at me, all full of Haskell pride. ;-)

Postpostscript

I realised I can generalise split beyond strings, to work on lists of anything of class Eq:

-- | A function inspired by python's string.split().  A list is split
-- on a separator which is itself a list (not a single element).
split :: Eq a => [a] -> [a] -> [[a]]
split tok splitme = unfoldr (sp1 tok) splitme
    where sp1 _ [] = Nothing
          sp1 t s = case find (t `isSuffixOf`) $ (inits s) of
                      Nothing -> Just (s, [])
                      Just p -> Just (take (length p - length t) p,
                                      drop (length p) s)
*StringSplit> split [2,3] [1,2,3,4,3,2,1,2,3,4]
[[1],[4,3,2,1],[4]]

11 Responses to “Python-style string split (and strip / trim) in Haskell”

  1. Matthew
    April 27th, 2007 | 8:28 pm

    I think most people just use Text.Regex.splitRegex.

  2. April 27th, 2007 | 9:03 pm

    Nice — thanks!

    I should definitely put aside some time to get to know the standard libraries better – that much is becoming clear. Still, it was a good exercise, and now at least there’s a good hit if you google for, say, “python string split in haskell”. :-)

    Cheers!

  3. Weave Jester
    April 28th, 2007 | 1:07 am

    For reference, here’s my attempt at the same function. I’m not sure it’s any more elegant, but may be useful if only because it takes a slightly different approach.

    split tok = spl [] where
    spl ws [] = [ws]
    spl ws vs | tok `isPrefixOf` vs = ws : spl [] (drop (length tok) vs)
    spl ws (v:vs) = spl ( ws ++ [v] ) vs

  4. Kurt
    April 30th, 2007 | 4:17 pm

    Small style point:
    Since you don’t change the ‘t’ parameter passed through ‘sp1′, you might as well just use ‘tok’ from the outer function, inside of ‘sp1′. One less thing to keep track of when reading the function later. So you get:

    split tok splitme = unfoldr sp1 splitme
        where sp1 [] = Nothing
              sp1 s  = case find (tok `isSuffixOf`) $ (inits s) of
                          Nothing -> Just (s, [])
                          Just p  -> Just (take (length p - length tok) p,
                                           drop (length p) s)

  5. April 30th, 2007 | 11:02 pm

    Good point, Kurt. I think I originally had sp1 factored out into a standalone function. Thanks!

  6. petekaz
    May 7th, 2007 | 10:46 am

    Mattew,

    John Goerzen has a library called MissingH. It contains all sorts of useful items that are not present in the standard Haskell libraries. One of which is a Data.String library containing what you have implemented (and more):

    http://software.complete.org/missingh/static/doc/Data-String.html

    John has a Python background and you can see the influence just by looking at this library.

    Pete

  7. kmb
    May 21st, 2007 | 3:18 pm

    I had the exact same need for a string splitting command a while back, except I wanted to use the command from a shell script. I decided to write my solution in Haskell naturally. My solution is here: http://hpaste.org/1929?lines=true, and the sources are here: (http://kmbphoto.net/strsplit.lhs, http://kmbphoto.net/strjoin.lhs) with pdfs from the literate haskell here: (http://kmbphoto.net/strsplit.pdf, http://kmbphoto.net/strjoin.pdf).

    Here are the main functions:

    
    breakString :: String -> String -> (String, String)
    breakString [] str = ([], str)
    breakString _ [] = ([],[])
    breakString delim str@(x:xs)
        | isPrefixOf delim str = ([],str') -- end recursion
        | otherwise = (x:ys,zs) -- recurse, collecting the discarded chars
        where (ys,zs) = breakString delim xs
              str' = drop (length delim) str
    
    splitString :: String -> String -> [String]
    splitString _ [] = []
    splitString delim str = ys : splitString delim zs
        where (ys,zs) = breakString delim str
    

  8. July 3rd, 2007 | 10:48 am

    Check it out: Bryan O’Sullivan shows me how it’s done.

  9. June 20th, 2008 | 3:58 pm

    Just googled ‘haskell split’ because i needed that functionality, read your post, then came up with my own solution:

    Prelude> foldl (\a b -> if b==’n’ then (a ++ [[]]) else (init a) ++ [(last a) ++ [b]]) [[]] “banana”
    ["ba","a","a"]

    Well, not as fancy, but alot smaller.

  10. mat
    January 8th, 2011 | 9:25 am

    Good posting, but I think using Data.List.intercalate is better.

    http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-List.html#v:intercalate

  11. mat
    January 8th, 2011 | 9:31 am

    Oops, I’m confused this with string joining… sorry.