replace-megaparsec is for finding text patterns, and also editing and replacing the found patterns. This activity is traditionally done with regular expressions, but replace-megaparsec uses Megaparsec parsers instead for the pattern matching.
Why would we want to do pattern matching and substitution with parsers instead of regular expressions?
Regular expressions can do “group capture” on sections of the matched pattern, but they can only return stringy lists of the capture groups. Parsers can construct typed data structures based on the capture groups, guaranteeing no disagreement between the pattern rules and the rules that we’re using to build data structures based on the pattern matches.
For example, consider scanning a string for numbers. A lot of different things can look like a number, and can have leading plus or minus signs, or be in scientific notation, or have commas, or whatever. If we try to parse all of the numbers out of a string using regular expressions, then we have to make sure that the regular expression and the string-to-number conversion function agree about exactly what is and what isn’t a numeric string. We can get into an awkward situation in which the regular expression says it has found a numeric string but the string-to-number conversion function fails. A typed parser will perform both the pattern match and the conversion, so it will never be in that situation.
Regular expressions are only able to pattern-match regular grammers. Parsers are able pattern-match with context-free grammers, and even context-sensitive or Turing grammers, if needed. See below for an example of lifting a
Statemonad for context-sensitive pattern-matching.
The replacement expression for a traditional regular expression-based substitution command is usually just a simple string template in which the Nth “capture group” can be inserted with the syntax
\N. With this library, instead of a template, we get an
editorfunction which can perform any computation, including IO.
Try the examples in
cabal v2-repl in the
The examples depend on these imports.
import Replace.Megaparsec import Text.Megaparsec import Text.Megaparsec.Char import Text.Megaparsec.Char.Lexer
sepCap family of parser combinators
The following examples show how to match a pattern to a string of text and deconstruct the string of text by separating it into sections which match the pattern, and sections which don’t match.
Pattern match, capture only the parsed result
Separate the input string into sections which can be parsed as a hexadecimal
number with a prefix
"0x", and sections which can’t.
let hexparser = chunk "0x" >> hexadecimal :: Parsec Void String Integer parseTest (sepCap hexparser) "0xA 000 0xFFFF"
[Right 10,Left " 000 ",Right 65535]
Pattern match, capture only the matched text
Just get the strings sections which match the hexadecimal parser, throw away the parsed number.
let hexparser = chunk "0x" >> hexadecimal :: Parsec Void String Integer parseTest (findAll hexparser) "0xA 000 0xFFFF"
[Right "0xA",Left " 000 ",Right "0xFFFF"]
Pattern match, capture the matched text and the parsed result
Capture the parsed hexadecimal number, as well as the string section which parses as a hexadecimal number.
let hexparser = chunk "0x" >> hexadecimal :: Parsec Void String Integer parseTest (findAllCap hexparser) "0xA 000 0xFFFF"
[Right ("0xA",10),Left " 000 ",Right ("0xFFFF",65535)]
Pattern match, capture only the locations of the matched patterns
Find all of the sections of the stream which match
Text.Megaparsec.Char.space1 parser (a string of whitespace).
Print a list of the offsets of the beginning of every pattern match.
import Data.Either let spaceoffset = getOffset <* space1 :: Parsec Void String Int parseTest (return . rights =<< sepCap spaceoffset) " a b "
Edit text strings by running parsers with
The following examples show how to search for a pattern in a string of text and then edit the string of text to substitute in some replacement text for the matched patterns.
Pattern match and replace with a constant
Replace all carriage-return-newline instances with newline.
streamEdit (chunk "\r\n") (const "\n") "1\r\n2\r\n"
Pattern match and edit the matches
Replace alphabetic characters with the next character in the alphabet.
streamEdit (some letterChar) (fmap succ) "HAL 9000"
Pattern match and maybe edit the matches, or maybe leave them alone
Find all of the string sections
s which can be parsed as a
r≤16, then replace
s with a decimal number. Uses the
let hexparser = chunk "0x" >> hexadecimal :: Parsec Void String Integer streamEdit (match hexparser) (\(s,r) -> if r <= 16 then show r else s) "0xA 000 0xFFFF"
"10 000 0xFFFF"
Context-sensitive pattern match and edit the matches
Capitalize the third letter in a string. The
capthird parser searches for
individual letters, and it needs to remember how many times it has run so
that it can match successfully only on the third time that it finds a letter.
To enable the parser to remember how many times it has run, we’ll
compose the parser with a
State monad from
mtl package. (Run in
cabal v2-repl -b mtl).
import qualified Control.Monad.State.Strict as MTL import Control.Monad.State.Strict (get, put, evalState) import Data.Char (toUpper) let capthird :: ParsecT Void String (MTL.State Int) String capthird = do x <- letterChar i <- get put (i+1) if i==3 then return [x] else empty flip evalState 1 $ streamEditT capthird (return . fmap toUpper) "a a a a a"
"a a A a a"
I wanted to scan a Markdown document and find tokens inside backticks that look like a Haskell identifier, then look up the identifier in Hoogle to see if it has a definition in base, and if so, insert a Hackage link for the identifier into the Markdown. I couldn’t find a simple and obvious way to do that with any existing technology.
Revision history for replace-megaparsec
220.127.116.11 – 2019-08-24
- First version. Megaparsec.