9.1.1 Regexp Patterns

8.14.0.2

9.1.1 Regexp Patterns🔗ℹ

The portion of a rx or rx_in form within '…' is a pattern that is written with regexp pattern operators. Some pattern operators overlap with expression operators, but they have different meanings and precedence in a pattern. For example, the pattern operator * creates a repetition pattern, instead of multiplying like the expression * operator.

regexp operator

#%literal string

regexp operator

#%literal bytes

A literal string or byte string can be used as a pattern. It matches the string’s characters or bytes literally. See also case_insensitive.

> rx'"hello"'.match("hello")
RXMatch("hello", [], {})
> rx'"hello"'.match("olleh")
#false
> rx'#"a"'.match(#"a")
RXMatch(Bytes.copy(#"a"), [], {})

regexp operator

pat #%juxtapose pat

regexp operator

pat ++ pat

regexp operator

pat #%call (pat)

Patterns that are adjacent in a larger pattern match in sequence. The ++ operator can be used to make sequencing explicit. An implicit #%call form is treated like #%juxtapose, consistent with implicit uses of parentheses for grouping as handled by #%parens.

> rx'"hello" " " "world"'.match("hello world")
RXMatch("hello world", [], {})
> rx'"hello" ++ " " ++ "world"'.match("hello world")
RXMatch("hello world", [], {})
> rx'"hello"
++ " "
++ "world"'.match("hello world")
RXMatch("hello world", [], {})

regexp operator

pat || pat

Matches as either the first pat or second pat. The first pat is tried first.

> rx'"a" || "b"'.match("a")
RXMatch("a", [], {})
> rx'"a" || "b"'.match("b")
RXMatch("b", [], {})
> rx'"a" || "b"'.match("c")
#false

regexp operator

#%parens (pat)

A parenthesized pattern is equivalent to the pat inside the parentheses. That is, parentheses are just for grouping and resolving precedence mismatches. See $ for inforation about capture groups, which are not implicitly created by parentheses (as they are in some traditional regexp languages).

> rx'"a" || "b" ++ "c"'.match("ac")
#false
> rx'("a" || "b") ++ "c"'.match("ac")
RXMatch("ac", [], {})

regexp operator

#%brackets [charset]

regexp operator

pat #%index [charset]

A […] pattern, which is an implicit use of #%brackets, matches a single character or byte, where charset determines the matching characters or bytes. An implicit #%index form (see Implicit Forms) is treated as a sequence of a pat and #%brackets.

See Regexp Character Sets for character set forms that can be used in charset.

> rx'["a"-"z"]'.match("m")
RXMatch("m", [], {})
> rx'["a"-"z"]'.match("0")
#false

regexp operator

pat *

regexp operator

pat * mode

mode

~greedy

~nongreedy

~possessive

Matches a sequence of 0 or more matches to pat.

> rx'any*'.match("abc")
RXMatch("abc", [], {})
> rx'any*'.match("")
RXMatch("", [], {})

By default, the match uses ~greedy mode, where a larger number of matches is tried first—but subsequent patterns may cause backtracking to a shorter match. In ~nongreedy mode, shorter matches are tried first. The ~possessive mode is like ~greedy, but without backtracking (i.e., the longest match must succeed overall for the enclosing pattern); see also cut.

> rx'($head: any*) ($tail: any*)'.match("abc")
RXMatch("abc", ["abc", ""], {#'head: 1, #'tail: 2})
> rx'($head: any* ~nongreedy) ($tail: any*)'.match("abc")
RXMatch("abc", ["", "abc"], {#'head: 1, #'tail: 2})

> rx'any* ~greedy "z"'.match("abcz")
RXMatch("abcz", [], {})
> rx'any* ~possessive "z"'.match("abcz")
#false

regexp operator

pat +

regexp operator

pat + mode

Like *, but matches 1 or more instances of pat.

> rx'any+'.match("abc")
RXMatch("abc", [], {})
> rx'any+'.match("")
#false

regexp operator

pat ?

regexp operator

pat ? mode

Similar to *, but matches 0 or 1 instances of pat.

> rx'any?'.match("a")
RXMatch("a", [], {})
> rx'any?'.match("")
RXMatch("", [], {})
> rx'any?'.match("abc")
#false

regexp operator

pat #%comp {count}

regexp operator

pat #%comp {min ..}

regexp operator

pat #%comp {min .. max}

Using {…} after a pattern, which is use of the implicit #%comp form, specifies a repetition like * or + more generally. If a single count is provided, it specifies an exact number of repetitions. If just min is provided, then it specifies a minimum number of repetitions, and there is no maximum. Finally, min and max both can be specified. A count, min, or max must be a literal nonnegative integer.

> rx'any{2}'.match("aa")
RXMatch("aa", [], {})
> rx'any{2}'.match("aaa")
#false

> rx'any{2..}'.match("aa")
RXMatch("aa", [], {})
> rx'any{2..}'.match("aaa")
RXMatch("aaa", [], {})

> rx'any{2..3}'.match("aa")
RXMatch("aa", [], {})
> rx'any{2..3}'.match("aaa")
RXMatch("aaa", [], {})
> rx'any{2..3}'.match("aaaa")
#false

regexp operator

any

regexp operator

char

regexp operator

byte

Matches a single character or byte. The . or any patterns are equivalent, and they do not match a newline character unless they are used under enable_newline. The char and byte patterns match any character or byte, including a newline, and also imply that that the enclosing regexp matches strings or byte strings, respectively.

> rx'.'.match("a")
RXMatch("a", [], {})
> rx'.'.match("\n")
#false
> rx'enable_newline: .'.match("\n")
RXMatch("\n", [], {})

> rx'char'.match("\n")
RXMatch("\n", [], {})
> rx'byte'.match("\n")
RXMatch(Bytes.copy(#"\n"), [], {})

regexp operator

.* mode

regexp operator

.? mode

Equivalent to . * and . ?, but allowing the space between the operators to be omitted.

> rx'.*'.match("abc")
RXMatch("abc", [], {})

regexp operator

bof

Matches the start of input or, in the case of ^ when not under enable_newline, the position after a newline. The bof operator always matches the beginning of input and is not affected by enable_newline.

A regexp created with rx (as opposed to rx_in is implicitly prefixed with bof for use with methods like Regexp.match (as opposed to Regexp.match_in).

> rx'^ "a"'.match_in("a")
RXMatch("a", [], {})
> rx'^ "a"'.match_in("xa")
#false
> rx'^ "a"'.match_in("x\na")
RXMatch("a", [], {})
> rx'bof "a"'.match_in("x\na")
#false
> rx'enable_newline: ^ "a"'.match_in("x\na")
#false

regexp operator

eof

regexp operator

$ identifier: pat

regexp operator

$ identifier

regexp operator

$ int

regexp operator

$ expr

The $ operator is overloaded for multiple uses:

When not followed by anything, $ matches the end of input or the position before a newline, analogous to the way that ^ matches the start of input or the position after a newline. The eof operator matches the end of input.
> rx'"a" $'.match_in("a")
RXMatch("a", [], {})
> rx'"a" $'.match_in("az")
#false
When followed by an identifier and a : for a block containing pat, $ creates a capture group. The portion of input that is matched against pat is recorded and associated with the name identifier. If the enclosing pattern uses pat zero or multiple times, then identifier is associated to #false if the pattern is used zero times, or it is associated to the latest match if used multiple times.
> rx'any ($m: any)'.match("ab")
RXMatch("ab", ["b"], {#'m: 1})
> rx'any ($m: any)'.match("ab")[#'m]
"b"
> rx'any ($m: any)*'.match("a")
RXMatch("a", [#false], {#'m: 1})
> def rx'any ($m: any)' = "ab"
> m
"b"

When followed by an identifier and no subsequent block, then $ is either a backreference to a named capture group, or it is a splice of a regexp that is bound to identifier.

The use of $ forms a backreference if identifier is associated to a capture group anywhere in the enclosing pattern; the backreference matches input that is the same as the most recent match for the capture group (and never matches if the capture group does not yet have a match).

> rx'any ($m: any) $m'.match("abb")
RXMatch("abb", ["b"], {#'m: 1})
> rx'any ($m: any) $m'.match("abc")
#false

When $ forms a splice, then a regular expression is formed dynamically by merging the referenced regexp into the enclosing pattern. (A limitation: both the merged regexp and enclosing pattern must be free of backreferences, because backreferences need to be converted from names to absolute positions eagerly.)

fun labeled(key) :: RX:
rx'$key ": " $name: .*'

> labeled(rx'"fruit"').match("fruit: apple")
RXMatch("fruit: apple", ["apple"], Map.by(===){#'name: 1})
> labeled(rx'"veggie"').match("veggie: carrot")
RXMatch("veggie: carrot", ["carrot"], Map.by(===){#'name: 1})

When followed by a literal integer, then $ forms a backreference that refers to a capture group by index instead of by name. Capture groups are numbered from 1, since 0 is reserved to refer to the entire match.
> rx'any ($m: any) $1'.match("abb")
RXMatch("abb", ["b"], {#'m: 1})
> rx'any ($m: any) $1'.match("abc")
#false
When followed by an expression other than an identifier or literal integer, then $ always forms a splice.

regexp operator

~~ pat

Matches pat as an unnamed capture group. The capture group’s match can only be referenced by index (counting from 1).

> rx'any ~~any any*'.match("abc")[1]
"b"
> rx'any ~~any $1'.match("abb")
RXMatch("abb", ["b"], {})

regexp operator

lookahead(pat)

regexp operator

lookback(pat)

regexp operator

! lookahead(pat)

regexp operator

! lookback(pat)

Matches an empty position in the input where the subsequent (for lookahead) or preceding (for lookback) input matches pat—or does not match, when a ! prefix is used.

> rx'. "a" lookahead("p")'.match_in("cat nap")
RXMatch("na", [], {})
> rx'. "a" !lookahead("t")'.match_in("cat nap")
RXMatch("na", [], {})
> rx'lookback("n") "a" .'.match_in("cat nap")
RXMatch("ap", [], {})
> rx'!lookback("c") "a" .'.match_in("cat nap")
RXMatch("ap", [], {})

regexp operator

word_boundary

regexp operator

word_continue

Matches an empty position in the input. The word_boundary pattern matches between an alphanumeric ASCII character (a-z, A-A, or 0-9) or _ and another character that is not alphanemeric ot _. The word_continue pattern matches positions that do not match word_boundary.

> rx'any+ ~nongreedy word_boundary'.match_in("cat nap")
RXMatch("cat", [], {})
> rx'any+ ~nongreedy word_continue'.match_in("cat nap")
RXMatch("c", [], {})

regexp operator

if lookahead(pat) | then_pat | else_pat

regexp operator

if lookback(pat) | then_pat | else_pat

regexp operator

if ! lookahead(pat) | then_pat | else_pat

regexp operator

if ! lookback(pat) | then_pat | else_pat

regexp operator

if $ identifier | then_pat | else_pat

regexp operator

if $ int | then_pat | else_pat

Matches as then_pat or else_pat, depending on the form immediately after if, which must be either a lookahead, lookback, or backreference pattern.

> rx'($x: "x")* if $x | "s" | "."'.match_in("xxxs")
RXMatch("xxxs", ["x"], {#'x: 1})
> rx'($x: "x")* if $x | "s" | "."'.match_in(".")
RXMatch(".", [#false], {#'x: 1})

regexp operator

cut

Matches an empty position in the input. The first potential match that reaches cut is the only one that is allowed to succeed. Note that a possessive repetition mode like * ~possessive is equivalent to using cut after the repetition.

In the case of a rx_in pattern or use of RX.match_in, cut applies only to a match attempt at a given input position. It does not prevent trying the match at a later position.

> rx'("ax" || "a") cut "x"'.match("ax")
#false
> rx'("a" || "ax") cut "x"'.match("ax")
RXMatch("ax", [], {})

regexp operator

bytes: pat

regexp operator

string: pat

Matches he same as pat, but specifies explicitly either byte-string mode or string mode.

> rx'string: "a"'.match("a")
RXMatch("a", [], {})
> rx'bytes: "a"'.match("a")
RXMatch(Bytes.copy(#"a"), [], {})
> rx'string: any'.match(#"\x80") // not UTF-8
#false
> rx'bytes: any'.match(#"\x80")
RXMatch(Bytes.copy(#"\200"), [], {})

regexp operator

case_sensitive: pat

regexp operator

case_insensitive: pat

Adjusts the treatment of literal strings and ranges in pat to match case-sensitive (the default) or case-insensitive. In case-insensitive mode, chacters are folded individually (as opposed for folding a string sequence, which can change its length).

> rx'"hello"'.match("HELLO")
#false
> rx'case_insensitive: "hello"'.match("HELLO")
RXMatch("HELLO", [], {})

regexp operator

enable_newline: pat

regexp operator

disable_newline: pat

Adjusts the meaning of any, ^, and $, in pat:

enable_newline allows any to match newlines and causes ^ and $ to match only at the beginning and end of the input.
disable_newline prevents any from matching newlines and allows ^ and $ to match just before and after newlines. This is the default mode.

> rx'"x" any "y"'.match("x\ny")
#false
> rx'enable_newline: "x" any "y"'.match("x\ny")
RXMatch("x\ny", [], {})

> rx'^ "x" $'.match_in("a\nx\nz")
RXMatch("x", [], {})
> rx'enable_newline: ^ "x" $'.match_in("a\nx\nz")
#false

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

Each of these names is bound both as a character set and as a pattern that can be used directly, instead of wrapping in […]. See the alpha, etc., character set for more information.

> rx'alpha'.match("m")
RXMatch("m", [], {})
> rx'alpha'.match("0")
#false

1	Rhombus Essentials
2	Collections and Iteration
3	Syntax Objects and Macros
4	Classes and Interfaces
5	Static Information and Binding
6	Core Reference
7	Meta and Macros Reference
8	Rhombus Static by Default
9	Libraries
10	Naming Conventions

9.1	Regular Expressions
9.2	Runtime
9.3	Measure
9.4	Runtime Paths
9.5	Random Number Generation
9.6	Unsafe Mode
9.7	Documentation Metadata

9.1.1	Regexp Patterns
9.1.2	Regexp Character Sets
9.1.3	Regexp Objects
9.1.4	Regexp Match Results