tl;dr
To be safe, do not use a regex literal with =~
.
Instead, use:
Whether
and <
/ >
are supported at all depends on the host platform, not Bash:
- they DO work on Linux,
- but NOT on BSD-based platforms such as macOS; there, use
[[:<:]]
and [[:>:]]
instead, which, in the context of an unquoted regex literal, must be escaped as [[:<:]]
and [[:>:]]
; the following works as expected, but only on BSD/macOS:
[[ ' myword ' =~ [[:<:]]myword[[:>:]] ]] && echo YES # OK
The problem wouldn't arise - on any platform - if you limited your regex to the constructs in the POSIX ERE (extended regular expression) specification.
Unfortunately, POSIX EREs do not support word-boundary assertions, though you can emulate them - see the last section.
As on macOS, no
-prefixed constructs are supported, so that handy character-class shortcuts such as s
and w
aren't available either.
However, the up-side is that such ERE-compliant regexes are then portable (work on both Linux and macOS, for instance)
=~
is the rare case (the only case?) of a built-in Bash feature whose behavior is platform-dependent: It uses the regex libraries of the platform it is running on, resulting in different regex flavors on different platforms.
Thus, it is generally non-trivial and requires extra care to write portable code that uses the =~
operator.
Sticking with POSIX EREs is the only robust approach, which means that you have to work around their limitations - see bottom section.
If you want to know more, read on.
On Bash v3.2+ (unless the compat31
shopt
option is set), the RHS (right-hand side operand) of the =~
operator must be unquoted in order to be recognized as a regex (if you quote the right operand, =~
performs regular string comparison instead).
More accurately, at least the special regex characters and sequences must be unquoted, so it's OK and useful to quote those substrings that should be taken literally; e.g., [[ '*' =~ ^'*' ]]
matches, because ^
is unquoted and thus correctly recognized as the start-of-string anchor, whereas *
, which is normally a special regex char, matches literally due to the quoting.
However, there appears to be a design limitation in (at least) bash 3.x
that prevents use of
-prefixed regex constructs (e.g., <
, >
,
, s
, w
, ...) in a literal =~
RHS; the limitation affects Linux, whereas BSD/macOS versions are not affected, due to fundamentally not supporting any
-prefixed regex constructs:
# Linux only:
# PROBLEM (see details further below):
# Seen by the regex engine as: <word>
# The shell eats the '' before the regex engine sees them.
[[ ' word ' =~ <word> ]] && echo MATCHES # !! DOES NOT MATCH
# Causes syntax error, because the shell considers the < unquoted.
# If you used \bword\b, the regex engine would see that as-is.
[[ ' word ' =~ \<word\> ]] && echo MATCHES # !! BREAKS
# Using the usual quoting rules doesn't work either:
# Seen by the regex engine as: \<word\> instead of <word>
[[ ' word ' =~ \<word\> ]] && echo MATCHES # !! DOES NOT MATCH
# WORKAROUNDS
# Aux. viarable.
re='<word>'; [[ ' word ' =~ $re ]] && echo MATCHES # OK
# Command substitution
[[ ' word ' =~ $(printf %s '<word>') ]] && echo MATCHES # OK
# Change option compat31, which then allows use of '...' as the RHS
# CAVEAT: Stays in effect until you reset it, may have other side effects.
# Using (...) around the command confines the effect to a subshell.
(shopt -s compat31; [[ ' word ' =~ '<word>' ]] && echo MATCHES) # OK
The problem:
Tip of the hat to Fólkvangr for his input.
A literal RHS of =~
is by design parsed differently than unquoted tokens as arguments, in an attempt to allow the user to focus on escaping characters just for the regex, without also having to worry about the usual shell escaping rules in unquoted tokens.
For instance,
[[ 'a[b' =~ a[b ]] && echo MATCHES # OK
matches, because the
is _passed through to the regex engine (that is, the regex engine too sees literal a[b
), whereas if you used the same unquoted token as a regular argument, the usual shell expansions applied to unquoted tokens would "eat" the
, because it is interpreted as a shell escape character:
$ printf %s a[b
a[b # '' was removed by the shell.
However, in the context of =~
this exceptional passing through of
is only applied before characters that are regex metacharacters by themselves, as defined by the ERE (extended regular expressions) POSIX specification (in order to escape them for the regex, so that they're treated as literals:
^ $ [ { . ? * + ( ) |
Conversely, these regex metacharacters may exceptionally be used unquoted - and indeed must be left unquoted to have their special regex meaning - even though most of them normally require
-escaping in unquoted tokens to prevent the shell from interpreting them.
Yet, a subset of the shell metacharacters do still need escaping, for the shell's sake, so as not to break the syntax of the [[ ... ]]
conditional:
& ; < > space
Since these characters aren't also regex metacharacters, there is no need to also support escaping them on the regex side, so that, for instance, the regex engine seeing &
in the RHS as just &
works fine.
For any other character preceded by
, the shell removes the
before sending the string to the regex engine (as it does during normal shell expansion), which is unfortunate, because then even characters that the shell doesn't consider special cannot be passed as <char>
to the regex engine, because the shell invariably passes them as just <char>
.
E.g,
is invariably seen as just b
by the regex engine.
It is therefore currently impossible to use a (by definition non-POSIX) regex construct in the form <char>
(e.g., <
, >
,
, s
, w
, d
, ...) in a literal, unquoted =~
RHS, because no form of escaping can ensure that these constructs are seen by the regex engine as such, after parsing by the shell:
Since neither <
, >
, nor b
are regex metacharacters, the shell removes the
from <
, >
,
(as happens in regular shell expansion). Therefore, passing <word>
, for instance, makes the regex engine see <word>
, which is not the intent:
[[ '<word>' =~ <word> ]] && echo YES
matches, because the regex engine sees <word>
.
[[ 'boo' =~ ^oo ]] && echo YES
matches, because the regex engine sees ^boo
.
Trying \<word\>
breaks the command, because the shell treats each \
as an escaped
, which means that metacharacter <
is then considered unquoted, causing a syntax error:
[[ ' word ' =~ \<word\> ]] && echo YES
causes a syntax error.
- This wouldn't happen with
\b
, but \b
is passed through (due to the
preceding a regex metachar,
), which also doesn't work:
[[ 'oo' =~ ^\boo ]] && echo YES
matches, because the regex engine sees \boo
, which matches literal oo
.
Trying \<word\>
- which by normal shell expansion rules results in <word>
(try printf %s \<word\>
) - also doesn't work:
What happens is that the shell eats the
in <
(ditto for
and other
-prefixed sequences), and then passes the preceding \
through to the regex engine as-is (again, because
is preserved before a regex metachar):
[[ ' <word> ' =~ \<word\> ]] && echo YES
matches, because the regex engine sees \<word\>
, which matches literal <word>
.
In short: