Regular Expressions
Various features in Gammadyne's programs support regular expressions. This article describes the syntax for the regular expressions. This article is merely an overview of the most commonly used capabilities. For more detailed information, we recommend going here or Googling "PCRE2". PCRE2 is the open-source engine that Gammadyne uses to support regular expressions. It is designed to be very close to the same syntax that PERL supports.
Backslash
Quantifiers
Subpatterns
Character Classes
Character Properties
Assertions
Replacement Strings
Index:
OverviewBackslash
Quantifiers
Subpatterns
Character Classes
Character Properties
Assertions
Replacement Strings
Overview
When talking about regular expressions, the text that you are searching is referred to as the "subject". A regular expression is a search pattern that matches a portion of the subject. Regular expressions are particularly useful for matching repetitions and alternatives.
Most characters represent themselves in a pattern. For example, the regular expression "foo" will match the text "foo" and nothing else. It is in "metacharacters" that the true power of regular expressions is exposed:
Most characters represent themselves in a pattern. For example, the regular expression "foo" will match the text "foo" and nothing else. It is in "metacharacters" that the true power of regular expressions is exposed:
Metacharacter | Description |
. | Match any character except newline. |
() | One or more metacharacters can be enclosed in parenthesis to form a "subpattern". |
\ | The backslash is a general purpose escape character with multiple uses. |
^ | Assert start of string/line. |
$ | Assert end of string/line. |
[] | Square brackets enclose a "character class". This matches an entire family of characters. |
? | Indicates that there can be 0 or 1 occurrences. Has other uses as well. |
* | Indicates 0 or more occurrences are allowed. |
+ | Indicates 1 or more occurrences are allowed. |
{} | Follows a metacharacter or subpattern to indicate the number of occurrences that are allowed, such as {3} or {2,5}. |
Backslash
The backslash serves multiple purposes. When it appears in front of a metacharacter, it takes away the metacharacter's special meaning. For example, "\+" matches "+" instead of matching 1 or more backslashes.
The backslash is also useful for representing certain special characters:
Furthermore, the backslash can represent various character types:
The backslash is also useful for representing certain special characters:
Escape | Description |
\f | Form feed (code 12) |
\n | Line feed (code 10) |
\r | Carriage return (code 13) |
\t | Tab (code 9) |
\o## | Octal code |
\x## | Hex code |
Furthermore, the backslash can represent various character types:
Escape | Description |
\C | Any character. Unlike the dot, this even matches newlines. |
\d | Digit (0-9). |
\D | Non digit. |
\h | Horizontal whitespace (spaces, tabs, etc.). |
\H | Non horizontal whitespace. |
\R | Any unicode newline. Equivalent to "(?>\r\n|\n|\x0b|\f|\r|\x85)". |
\s | Whitespace (newlines, spaces, tabs, etc.). |
\S | Non whitespace. |
\v | Vertical whitespace (linefeeds, carriage returns, etc.). |
\V | Non vertical whitespace. |
\w | Letters, digits, and the underscore (_). |
\W | Everything but letters, digits, and the underscore. |
Quantifiers
A quantifier is used to specify repetitions. Any character, metacharacter, subpattern, or character class can be followed by a quantifier.
The asterisk (*) quantifier allows for 0 or more repetitions. For example, "a*" matches "", "a", "aa", "aaa", etc.
The plus (+) quantifier allows for 1 or more repetitions. For example, "(ab)+" matches "ab", "abab", "ababab", etc.
The question mark (?) quantifier allows for 0 or 1 repetitions. For example, "c?" matches "" or "c"
The curly brace ({}) quantifier allows for a specific number of repetitions. It can specify a range of repetitions, such as "{2,5}", or a lower bound with no limit, such as "{3,}". For example, "d{3,4}" matches "ddd" or "dddd", and "e{4,}" matches "eeee", "eeeee", "eeeeee", etc.
The asterisk (*) quantifier allows for 0 or more repetitions. For example, "a*" matches "", "a", "aa", "aaa", etc.
The plus (+) quantifier allows for 1 or more repetitions. For example, "(ab)+" matches "ab", "abab", "ababab", etc.
The question mark (?) quantifier allows for 0 or 1 repetitions. For example, "c?" matches "" or "c"
The curly brace ({}) quantifier allows for a specific number of repetitions. It can specify a range of repetitions, such as "{2,5}", or a lower bound with no limit, such as "{3,}". For example, "d{3,4}" matches "ddd" or "dddd", and "e{4,}" matches "eeee", "eeeee", "eeeeee", etc.
Subpatterns
Part of a regular expression can be enclosed in parenthesis. This is referred to as a "subpattern". This allows a series of characters to be repeated using a quantifier. For example, "(abc)+" matches "abc", "abcabc", "abcabcabc", etc.
Another important use for subpatterns is specifying alternatives using the vertical bar (|). For example, "(abc|def)" matches either "abc" or "def".
Another important use for subpatterns is specifying alternatives using the vertical bar (|). For example, "(abc|def)" matches either "abc" or "def".
Character Classes
Square brackets ([]) are used to specify a "character class", which matches a single character that is a member of a set.
The simplest character class is simply a list of allowed characters. For example, "[abc]" matches either "a", "b", or "c".
It is also possible to specify a range of characters using a dash. For example, "[d-f]" matches either "d", "e", or "f".
If the first character is a circumflex (^), the character class is negated. For example, "[^aeiou]" matches any character that is not a vowel.
Character classes can include backslashed character types and properties. For example, "[\d\h\p{Lu}]" matches a digit, a whitespace, or an upper case letter.
The simplest character class is simply a list of allowed characters. For example, "[abc]" matches either "a", "b", or "c".
It is also possible to specify a range of characters using a dash. For example, "[d-f]" matches either "d", "e", or "f".
If the first character is a circumflex (^), the character class is negated. For example, "[^aeiou]" matches any character that is not a vowel.
Character classes can include backslashed character types and properties. For example, "[\d\h\p{Lu}]" matches a digit, a whitespace, or an upper case letter.
Character Properties
Every character belongs to one or more general property classes. You can match a character in the class with "\p{property}" where property is one of the following:
For example, "\p{Sm}" will match any mathematical symbol, such as "+" or "=".
You can negate a property test by using a capital P instead. For example, "\P{N}" will match any non-number.
Property | Description |
C | Other |
Cc | Control |
Cf | Format |
Cn | Assigned |
Co | Private use |
Cs | Surrogate (U+D800 to U+DFFF) |
L | Letter |
Ll | Lower case letter |
Lm | Modifier letter |
Lo | Other letter: ªº |
Lt | Title case letter |
Lu | Upper case letter |
L& | Lower, upper, or title case letter |
M | Mark |
Mc | Spacing mark |
Me | Enclosing mark |
Mn | Non-spacing mark |
N | Number: 0123456789²³¹¼½¾ |
Nd | Decimal number: 0123456789 |
Nl | Letter number |
No | Other number: ²³¹¼½¾ |
P | Punctuation: !"#%&'()*,-./:;?@[\]_{}¡§«¶·»¿ |
Pc | Connector punctuation: _ |
Pd | Dash punctuation: - |
Pe | Close punctuation: )]} |
Pf | Final punctuation: » |
Pi | Initial punctuation: « |
Po | Other punctuation: !"#%&'*,./:;?@\¡§¶·¿ |
Ps | Open punctuation: ([{ |
S | Symbol: $+<=>^`|~¢£¤¥¦¨©¬®¯°±´¸×÷ |
Sc | Currency symbol: $¢£¤¥ |
Sk | Modifier symbol: ^`¨¯´¸ |
Sm | Mathematical symbol: +<=>|~¬±×÷ |
So | Other symbol: ¦©®° |
Z | Separator |
Zl | Line separator |
Zp | Paragraph separator |
Zs | Space separator |
For example, "\p{Sm}" will match any mathematical symbol, such as "+" or "=".
You can negate a property test by using a capital P instead. For example, "\P{N}" will match any non-number.
Assertions
An "assertion" specifies a condition that must be met, without consuming any characters from the subject. The basic assertions are:
Assertion | Description |
^ | Start of string/line. |
$ | End of string/line. |
\A | Start of the subject. |
\b | At a word boundary. |
\B | Not at a word boundary. |
\z | End of the subject. |
\Z | End of the subject, or before a newline at the end of the subject. |
Replacement Strings
When performing a search and replace operation, it is possible for the replacement string to make references to the matched portion of the subject. The entire match can be represented with "$0". For example, if the match is "red", and the replacement string is "dark $0", then the match will be replaced with "dark red".
It is also possible to refer to a subpattern that was enclosed in parenthesis. Each subpattern is assigned an index, beginning at 1, from left to right. For example, "$1" refers to the first subpattern, "$2" the second, and so on.
Consider the search string "((black|blue) (Mustang|Camaro))" and the replacement string "$1,$2,$3". If the subject is "The blue Mustang is fast.", after the search and replace the subject would change to "blue Mustang,blue,Mustang". $1 refers to the outermost parenthesis, which was encountered first. $2 refers to "(black|blue)" which was encountered next. And $3 refers to "(Mustang|Camaro)" which was encountered last.
It is also possible to refer to a subpattern that was enclosed in parenthesis. Each subpattern is assigned an index, beginning at 1, from left to right. For example, "$1" refers to the first subpattern, "$2" the second, and so on.
Consider the search string "((black|blue) (Mustang|Camaro))" and the replacement string "$1,$2,$3". If the subject is "The blue Mustang is fast.", after the search and replace the subject would change to "blue Mustang,blue,Mustang". $1 refers to the outermost parenthesis, which was encountered first. $2 refers to "(black|blue)" which was encountered next. And $3 refers to "(Mustang|Camaro)" which was encountered last.