5.5.1 Regular Expression Language
A regexp pattern with rx is written between '…' using a language that is distinct from Rhombus’s normal expression language. A pair of '…' does not form a string, but instead groups a set of shrubbery terms, and the rx form interprets those terms differently than Rhombus expressions, even though they use some of the same operators.
For example,
is a regular expression that matches any number of repetitions of any non-newline character, since the . operator matches any non-newline character, and the * operator combines with the preceding pattern to match any number (zero or more) repetitions of that pattern.
More precisely, .* is a single shrubbery operator, but .* is defined as an alias of . followed by .*.
Literal characters are written as strings, so the regexp
matches any number of as followed by any number of bs. Juxtaposed patterns, such as "a"* followed by "b"*, match an input that is a match to the first pattern followed by a match to the second.
RXMatch("aaabb", [], {})
#false
#false
> rx'"a"* "b"*'.match_in("xx") // matches 0 repetitions at start
RXMatch("", [], {})
The + and ? operators have their conventional meanings of “one or more repetitions” and “zero or one repetitions.”
RXMatch("aaabb", [], {})
RXMatch("aaabb", [], {})
A string matches a sequence of characters. To match any character in a set, use […] to create the set, and list characters in the set as strings. The - operator in a character set adds an inclusive range of characters to the set.
#false
RXMatch("ba", [], {})
RXMatch("ba", [], {})
A character-set pattern is a special case of pattern alternatives, but the || operator supports alternatives more generally. A pattern formed with || matches when the pattern to its left matches or the pattern to its right matches. The precedence of || is weaker than juxtaposition. Parentheses are simply grouping forms, and they do not imply not capture groups as in some regexp notations.
RXMatch("aaabb", [], {})
RXMatch("xxyyy", [], {})
#false
RXMatch("abz", [], {})
RXMatch("xyz", [], {})
The $ operator creates a capture group when it is followed by an identifier and : block containing a pattern. It matches the same as the pattern in the block, but it also associates that match with the specified identifier, which is useful when the $ pattern is part of a larger pattern. The RXMatch result of RX.match can be indexed in the same way as a map (see Maps) using the symbol forms of the identifier, and that extract the matching portion of the input. Alternatively, a match can be indexed by position of the capture group, where index 0 corresponds to the whole pattern.
RXMatch("xyz", ["xy"], {#'prefix: 1})
> def m = rx'($prefix: "a"+ "b"+ || "x"+ "y"+) "z"'.match("xyz")
> m[#'prefix]
"xy"
> m[0]
"xyz"
> m[1]
"xy"
When $ is followed by and identifier without a subsequence : block, then it is a backreference to a capture group. The backreference matches input that is exactly the same as the captured match.
Backreferences extend the matching capability of regexps beyond the automata-theory definition of regular expressions. but backreference support is commonly included in regexp libraries, anyway.
> rx'($prefix: "a"+ "b"+ || "x"+ "y"+) $prefix'.match("xyxy")
RXMatch("xyxy", ["xy"], {#'prefix: 1})
> rx'($prefix: "a"+ "b"+ || "x"+ "y"+) $prefix'.match("xyxxy")
#false
The $ operator support backreferences by integer index, too. The ~~ operator creates an anonymous capture group that can only be referenced by index.
The $ operator supports one more overloading: when not followed by an identifier or immediate integer, it serves as an escape back to Rhombus expressions. The result of the expression must be a regexp object that can be spliced into the enclosing pattern.
RXMatch("1 through 99", [], {})
> rx'$num_rx ", " $num_rx ", and " $num_rx'.match("1, 10, and 100")
#false
We’ve only touched on a few of the regexp pattern and character set operators provided by rhombus/rx. Regexp Quick Reference, provides a quick reference to the full set of exported operators. New operators can be defined outside of rhombus/rx using rx.macro or rx_charset.macro.