8.16.0.4
5.5.3 String, Byte String, and Port Matching🔗ℹ

A regexp produces by rx matches in either character mode or byte mode. The mode is inferred from the rx pattern. For example, it a literal string is part of the pattern, then it must be in character mode, but if a literal byte string is part of the pattern, it must be in byte mode. The string and bytes forms can be used to make the choice explicit.

Either mode can work with a string input to match, and either can work with a byte string input to match. In the case of string mode, . and any match a Unicode character, so given a byte string input, they match UTF-8 encoding sequences, only. Along similar lines, a byte-based regexp given a string input matches against the UTF-8 encoding of the string.

def char_rx = rx'string: . . .'

def byte_rx = rx'bytes: . . .'

> char_rx.match("abc")

RXMatch("abc", [], {})

> char_rx.match("λλλ")

RXMatch("λλλ", [], {})

> byte_rx.match("abc")

RXMatch(Bytes.copy(#"abc"), [], {})

> byte_rx.match("λλλ") // six bytes in UTF-8

#false

> char_rx.match(#"abc")

RXMatch(Bytes.copy(#"abc"), [], {})

> char_rx.match(#"a\xFF\xFF") // not valid UTF-8

#false

> char_rx.match(#"\316\273\316\273\316\273")

RXMatch(Bytes.copy(#"\316\273\316\273\316\273"), [], {})

A regexp match can be applied directly to an input port, as opposed to reading bytes or strings from the port and then matching. Direct port matching is especially useful with rx_in or RX.match_in to find the first match, because bytes can be read from the port lazily to find a match, and no further bytes will be consumed after a match ends. A port is treated like a byte string for input, so even if a character-based regexp is used, results are reported in terms of bytes.

def inp = Port.Input.open_string("abcdef")

> char_rx.match_in(inp)

RXMatch(Bytes.copy(#"abc"), [], {})

> char_rx.match_in(inp)

RXMatch(Bytes.copy(#"def"), [], {})

> char_rx.match_in(inp)

#false

def inp = Port.Input.open_string("abcdef")

> rx'"d"'.match_range_in(inp)

RXMatch(3 .. 4, [], {})