8.16.0.4
5.5.1 Regular Expression Language🔗

A regexp pattern with rx is written between '' using a language that is distinct from Rhombus’s normal expression language. A pair of '' does not form a string, but instead groups a set of shrubbery terms, and the rx form interprets those terms differently than Rhombus expressions, even though they use some of the same operators.

For example,

rx'.*'

is a regular expression that matches any number of repetitions of any non-newline character, since the . operator matches any non-newline character, and the * operator combines with the preceding pattern to match any number (zero or more) repetitions of that pattern.

More precisely, .* is a single shrubbery operator, but .* is defined as an alias of . followed by .*.

Literal characters are written as strings, so the regexp

rx'"a"* "b"*'

matches any number of as followed by any number of bs. Juxtaposed patterns, such as "a"* followed by "b"*, match an input that is a match to the first pattern followed by a match to the second.

import:

  rhombus/rx open

> rx'"a"* "b"*'.match("aaabb")

RXMatch("aaabb", [], {})

> rx'"a"* "b"*'.match("xxaaabb")

#false

> rx'"a"* "b"*'.match("aaabbxx")

#false

> rx'"a"* "b"*'.match_in("xx") // matches 0 repetitions at start

RXMatch("", [], {})

The + and ? operators have their conventional meanings of “one or more repetitions” and “zero or one repetitions.”

> rx'"a"+ "b"+'.match("aaabb")

RXMatch("aaabb", [], {})

> rx'"a"+ "b"+'.match_in("xxaaabbxx")

RXMatch("aaabb", [], {})

A string matches a sequence of characters. To match any character in a set, use [] to create the set, and list characters in the set as strings. The - operator in a character set adds an inclusive range of characters to the set.

> rx'"a" "b"+'.match("ba")

#false

> rx'["a" "b"]+'.match("ba")

RXMatch("ba", [], {})

> rx'["a"-"z"]+'.match("ba")

RXMatch("ba", [], {})

A character-set pattern is a special case of pattern alternatives, but the || operator supports alternatives more generally. A pattern formed with || matches when the pattern to its left matches or the pattern to its right matches. The precedence of || is weaker than juxtaposition. Parentheses are simply grouping forms, and they do not imply not capture groups as in some regexp notations.

> rx'"a"+ "b"+ || "x"+ "y"+'.match("aaabb")

RXMatch("aaabb", [], {})

> rx'"a"+ "b"+ || "x"+ "y"+'.match("xxyyy")

RXMatch("xxyyy", [], {})

> rx'"a"+ "b"+ || "x"+ "y"+'.match("xxbbb")

#false

> rx'("a"+ "b"+ || "x"+ "y"+) "z"'.match("abz")

RXMatch("abz", [], {})

> rx'("a"+ "b"+ || "x"+ "y"+) "z"'.match("xyz")

RXMatch("xyz", [], {})

The $ operator creates a capture group when it is followed by an identifier and : block containing a pattern. It matches the same as the pattern in the block, but it also associates that match with the specified identifier, which is useful when the $ pattern is part of a larger pattern. The RXMatch result of RX.match can be indexed in the same way as a map (see Maps) using the symbol forms of the identifier, and that extract the matching portion of the input. Alternatively, a match can be indexed by position of the capture group, where index 0 corresponds to the whole pattern.

> rx'($prefix: "a"+ "b"+ || "x"+ "y"+) "z"'.match("xyz")

RXMatch("xyz", ["xy"], {#'prefix: 1})

> def m = rx'($prefix: "a"+ "b"+ || "x"+ "y"+) "z"'.match("xyz")

> m[#'prefix]

"xy"

> m[0]

"xyz"

> m[1]

"xy"

When $ is followed by and identifier without a subsequence : block, then it is a backreference to a capture group. The backreference matches input that is exactly the same as the captured match.

Backreferences extend the matching capability of regexps beyond the automata-theory definition of regular expressions. but backreference support is commonly included in regexp libraries, anyway.

> rx'($prefix: "a"+ "b"+ || "x"+ "y"+) $prefix'.match("xyxy")

RXMatch("xyxy", ["xy"], {#'prefix: 1})

> rx'($prefix: "a"+ "b"+ || "x"+ "y"+) $prefix'.match("xyxxy")

#false

The $ operator support backreferences by integer index, too. The ~~ operator creates an anonymous capture group that can only be referenced by index.

The $ operator supports one more overloading: when not followed by an identifier or immediate integer, it serves as an escape back to Rhombus expressions. The result of the expression must be a regexp object that can be spliced into the enclosing pattern.

def num_rx = rx'["1"-"9"]* ["0"-"9"]'

> rx'$num_rx " through " $num_rx'.match("1 through 99")

RXMatch("1 through 99", [], {})

> rx'$num_rx ", " $num_rx ", and " $num_rx'.match("1, 10, and 100")

#false

We’ve only touched on a few of the regexp pattern and character set operators provided by rhombus/rx. Regexp Quick Reference, provides a quick reference to the full set of exported operators. New operators can be defined outside of rhombus/rx using rx.macro or rx_charset.macro.