5.5.1 Regular Expression Language

8.17.0.6

5.5.1 Regular Expression Language🔗ℹ

A regexp pattern with rx is written between '…' using a language that is distinct from Rhombus’s normal expression language. A pair of '…' does not form a string, but instead groups a set of shrubbery terms, and the rx form interprets those terms differently than Rhombus expressions, even though they use some of the same operators.

For example,

rx'.*'

is a regular expression that matches any number of repetitions of any non-newline character, since the . operator matches any non-newline character, and the * operator combines with the preceding pattern to match any number (zero or more) repetitions of that pattern.

More precisely, .* is a single shrubbery operator, but .* is defined as an alias of . followed by .*.

Literal characters are written as strings, so the regexp

rx'"a"* "b"*'

matches any number of as followed by any number of bs. Juxtaposed patterns, such as "a"* followed by "b"*, match an input that is a match to the first pattern followed by a match to the second.

import:
rhombus/rx open

> rx'"a"* "b"*'.match("aaabb")
RXMatch("aaabb", [], {})
> rx'"a"* "b"*'.match("xxaaabb")
#false
> rx'"a"* "b"*'.match("aaabbxx")
#false
> rx'"a"* "b"*'.match_in("xx") // matches 0 repetitions at start
RXMatch("", [], {})

The + and ? operators have their conventional meanings of “one or more repetitions” and “zero or one repetitions.”

> rx'"a"+ "b"+'.match("aaabb")
RXMatch("aaabb", [], {})
> rx'"a"+ "b"+'.match_in("xxaaabbxx")
RXMatch("aaabb", [], {})

A string matches a sequence of characters. To match any character in a set, use […] to create the set, and list characters in the set as strings. The - operator in a character set adds an inclusive range of characters to the set.

> rx'"a" "b"+'.match("ba")
#false
> rx'["a" "b"]+'.match("ba")
RXMatch("ba", [], {})
> rx'["a"-"z"]+'.match("ba")
RXMatch("ba", [], {})

A character-set pattern is a special case of pattern alternatives, but the || operator supports alternatives more generally. A pattern formed with || matches when the pattern to its left matches or the pattern to its right matches. The precedence of || is weaker than juxtaposition. Parentheses are simply grouping forms, and they do not imply not capture groups as in some regexp notations.

> rx'"a"+ "b"+ || "x"+ "y"+'.match("aaabb")
RXMatch("aaabb", [], {})
> rx'"a"+ "b"+ || "x"+ "y"+'.match("xxyyy")
RXMatch("xxyyy", [], {})
> rx'"a"+ "b"+ || "x"+ "y"+'.match("xxbbb")
#false
> rx'("a"+ "b"+ || "x"+ "y"+) "z"'.match("abz")
RXMatch("abz", [], {})
> rx'("a"+ "b"+ || "x"+ "y"+) "z"'.match("xyz")
RXMatch("xyz", [], {})

The $ operator creates a capture group when it is followed by an identifier and : block containing a pattern. It matches the same as the pattern in the block, but it also associates that match with the specified identifier, which is useful when the $ pattern is part of a larger pattern. The RXMatch result of RX.match can be indexed in the same way as a map (see Maps) using the symbol forms of the identifier, and that extract the matching portion of the input. Alternatively, a match can be indexed by position of the capture group, where index 0 corresponds to the whole pattern.

> rx'($prefix: "a"+ "b"+ || "x"+ "y"+) "z"'.match("xyz")
RXMatch("xyz", ["xy"], {#'prefix: 1})
> def m = rx'($prefix: "a"+ "b"+ || "x"+ "y"+) "z"'.match("xyz")
> m[#'prefix]
"xy"
> m[0]
"xyz"
> m[1]
"xy"

When $ is followed by and identifier without a subsequence : block, then it is a backreference to a capture group. The backreference matches input that is exactly the same as the captured match.

Backreferences extend the matching capability of regexps beyond the automata-theory definition of regular expressions. but backreference support is commonly included in regexp libraries, anyway.

> rx'($prefix: "a"+ "b"+ || "x"+ "y"+) $prefix'.match("xyxy")
RXMatch("xyxy", ["xy"], {#'prefix: 1})
> rx'($prefix: "a"+ "b"+ || "x"+ "y"+) $prefix'.match("xyxxy")
#false

The $ operator support backreferences by integer index, too. The ~~ operator creates an anonymous capture group that can only be referenced by index.

The $ operator supports one more overloading: when not followed by an identifier or immediate integer, it serves as an escape back to Rhombus expressions. The result of the expression must be a regexp object that can be spliced into the enclosing pattern.

def num_rx = rx'["1"-"9"]* ["0"-"9"]'

> rx'$num_rx " through " $num_rx'.match("1 through 99")
RXMatch("1 through 99", [], {})
> rx'$num_rx ", " $num_rx ", and " $num_rx'.match("1, 10, and 100")
#false

We’ve only touched on a few of the regexp pattern and character set operators provided by rhombus/rx. Regexp Quick Reference, provides a quick reference to the full set of exported operators. New operators can be defined outside of rhombus/rx using rx.macro or rx_charset.macro.

1	Quick Start
2	Rhombus Essentials
3	Collections and Iteration
4	Classes and Interfaces
5	Input, Output, and Strings
6	Syntax Objects and Macros
7	Annotations
8	Static Information, Binding, and Annotation
9	Building and Running Rhombus Programs
10	Using Racket Tools and Libraries

5.1	Printing Strings and Other Values
5.2	String Interpolation
5.3	Input and Output Ports
5.4	Closeable Objects
5.5	Regular Expressions

5.5.1	Regular Expression Language
5.5.2	Full versus Partial Regexp Matching
5.5.3	String, Byte String, and Port Matching
5.5.4	Regexp Quick Reference