Capturing Parentheses
Capturing Parentheses are of the form (<regex>)
and can be used to
group arbitrarily complex regular expressions. Parentheses can be
nested, but the total number of parentheses, nested or otherwise, is
limited to 50 pairs. The text that is matched by the regular expression
between a matched set of parentheses is captured and available for text
substitutions and backreferences (see below.) Capturing parentheses
carry a fairly high overhead both in terms of memory used and execution
speed, especially if quantified by *
or +
.
Non-Capturing Parentheses
Non-Capturing Parentheses are of the form (?:<regex>)
and facilitate
grouping only and do not incur the overhead of normal capturing
parentheses. They should not be counted when determining numbers for
capturing parentheses which are used with backreferences and
substitutions. Because of the limit on the number of capturing
parentheses allowed in a regex, it is advisable to use non-capturing
parentheses when possible.
Positive Look-Ahead
Positive look-ahead constructs are of the form (?=<regex>)
and
implement a zero width assertion of the enclosed regular expression. In
other words, a match of the regular expression contained in the positive
look-ahead construct is attempted. If it succeeds, control is passed to
the next regular expression atom, but the text that was consumed by the
positive look-ahead is first unmatched (backtracked) to the place in the
text where the positive look-ahead was first encountered.
One application of positive look-ahead is the manual implementation of a first character discrimination optimization. You can include a positive look-ahead that contains a character class which lists every character that the following (potentially complex) regular expression could possibly start with. This will quickly filter out match attempts that cannot possibly succeed.
Negative Look-Ahead
Negative look-ahead takes the form (?!<regex>)
and is exactly the
same as positive look-ahead except that the enclosed regular expression
must not match. This can be particularly useful when you have an
expression that is general, and you want to exclude some special cases.
Simply precede the general expression with a negative look-ahead that
covers the special cases that need to be filtered out.
Positive Look-Behind
Positive look-behind constructs are of the form (?<=<regex>)
and
implement a zero width assertion of the enclosed regular expression in
front of the current matching position. It is similar to a positive
look-ahead assertion, except for the fact that the match is attempted on
the text preceding the current position, possibly even in front of the
start of the matching range of the entire regular expression.
A restriction on look-behind expressions is the fact that the expression
must match a string of a bounded size. In other words, *
, +
, and
{n,}
quantifiers are not allowed inside the look-behind expression.
Moreover, matching performance is sensitive to the difference between
the upper and lower bound on the matching size. The smaller the
difference, the better the performance. This is especially important for
regular expressions used in highlight patterns.
Positive look-behind has similar applications as positive look-ahead.
Negative Look-Behind
Negative look-behind takes the form (?<!<regex>)
and is exactly
the same as positive look-behind except that the enclosed regular
expression must not match. The same restrictions apply.
Note however, that performance is even more sensitive to the distance between the size boundaries: a negative look-behind must not match for any possible size, so the matching engine must check every size.
Case Sensitivity
There are two parenthetical constructs that control case sensitivity:
(?i<regex>)
Case insensitive;AbcD
andaBCd
are equivalent.(?I<regex>)
Case sensitive;AbcD
andaBCd
are different.
Regular expressions are case sensitive by default, that is,
(?I<regex>)
is assumed. All regular expression token types respond
appropriately to case insensitivity including character classes and
backreferences. There is some extra overhead involved when case
insensitivity is in effect, but only to the extent of converting each
character compared to lower case.
Matching Newlines
NEdit-ng regular expressions by default handle the matching of newlines in a way that should seem natural for most editing tasks. There are situations, however, that require finer control over how newlines are matched by some regular expression tokens.
By default, NEdit-ng regular expressions will not match a newline
character for the following regex tokens: dot .
; a negated
character class ([^...]
); and the following shortcuts for character
classes:
\d, \D, \l, \L, \s, \S, \w, \W, \Y
The matching of newlines can be controlled for the .
token, negated
character classes, and the \s
and \S
shortcuts by using one of the
following parenthetical constructs:
(?n<regex>)
.
,[^...]
,\s
,\S
match newlines(?N<regex>)
.
,[^...]
,\s
,\S
don't match newlines (default behavior)
Notes on New Parenthetical Constructs
Except for plain parentheses, none of the parenthetical constructs
capture text. If that is desired, the construct must be wrapped with
capturing parentheses, e.g. ((?i<regex>))
.
All parenthetical constructs can be nested as deeply as desired, except for capturing parentheses which have a limit of 50 sets of parentheses, regardless of nesting level.
Back References
Backreferences allow you to match text captured by a set of capturing
parenthesis at some later position in your regular expression. A
backreference is specified using a single backslash followed by a single
digit from 1 to 9 (example: \3
). Backreferences have similar syntax to
substitutions (see below), but are different from substitutions in that
they appear within the regular expression, not the substitution string.
The number specified with a backreference identifies which set of text
capturing parentheses the backreference is associated with. The text
that was most recently captured by these parentheses is used by the
backreference to attempt a match. As with substitutions, open
parentheses are counted from left to right beginning with 1. So the
backreference \3
will try to match another occurrence of the text
most recently matched by the third set of capturing parentheses. As an
example, the regular expression (\d)\1
could match 22
, 33
, or
00
, but wouldn't match 19
or 01
.
A backreference must be associated with a parenthetical expression that
is complete. The expression (\w(\1))
contains an invalid
backreference since the first set of parentheses are not complete at the
point where the backreference appears.
Substitution
Substitution strings are used to replace text matched by a set of capturing parentheses. The substitution string is mostly interpreted as ordinary text except as follows.
The escape sequences described above for special characters, and octal
and hexadecimal escapes are treated the same way by a substitution
string. When the substitution string contains the &
character,
NEdit-ng will substitute the entire string that was matched by the
Search → Find...' operation. Any of the first nine sub-expressions of the match
string can also be inserted into the replacement string. This is done by
inserting a \
followed by a digit from 1 to 9 that represents the
string matched by a parenthesized expression within the regular
expression. These expressions are numbered left-to-right in order of
their opening parentheses.
The capitalization of text inserted by &
or \1
, \2
, ... \9
can be altered by preceding them with \U
, \u
, \L
, or \l
.
\u
and \l
change only the first character of the inserted entity,
while \U
and \L
change the entire entity to upper or lower case,
respectively.