Gammadyne Corporation
Home Contact

Regular Expressions

Various features in Gammadyne's programs support regular expressions.  This article describes the syntax for the regular expressions.  This article is merely an overview of the most commonly used capabilities.  For more detailed information, we recommend going here or Googling "PCRE2".  PCRE2 is the open-source engine that Gammadyne uses to support regular expressions.  It is designed to be very close to the same syntax that PERL supports.

Index:

Overview
Backslash
Quantifiers
Subpatterns
Character Classes
Character Properties
Assertions
Replacement Strings

Overview

When talking about regular expressions, the text that you are searching is referred to as the "subject".  A regular expression is a search pattern that matches a portion of the subject.  Regular expressions are particularly useful for matching repetitions and alternatives.

Most characters represent themselves in a pattern.  For example, the regular expression "foo" will match the text "foo" and nothing else.  It is in "metacharacters" that the true power of regular expressions is exposed:

Metacharacter Description
.Match any character except newline.
()One or more metacharacters can be enclosed in parenthesis to form a "subpattern".
\The backslash is a general purpose escape character with multiple uses.
^Assert start of string/line.
$Assert end of string/line.
[]Square brackets enclose a "character class".  This matches an entire family of characters.
?Indicates that there can be 0 or 1 occurrences.  Has other uses as well.
*Indicates 0 or more occurrences are allowed.
+Indicates 1 or more occurrences are allowed.
{}Follows a metacharacter or subpattern to indicate the number of occurrences that are allowed.

Backslash

The backslash serves multiple purposes.  When it appears in front of a metacharacter, it takes away the metacharacter's special meaning.  For example, "\+" matches "+" instead of matching 1 or more backslashes.

The backslash is also useful for representing certain special characters:
Escape Description
\fForm feed (code 12)
\nLine feed (code 10)
\rCarriage return (code 13)
\tTab (code 9)
\o##Octal code
\x##Hex code

Furthermore, the backslash can represent various character types:
Escape Description
\CAny character. Unlike the dot, this even matches newlines.
\dDigit (0-9).
\DNon digit.
\hHorizontal whitespace (spaces, tabs, etc.).
\HNon horizontal whitespace.
\RAny unicode newline. Equivalent to "(?>\r\n|\n|\x0b|\f|\r|\x85)".
\sWhitespace (newlines, spaces, tabs, etc.).
\SNon whitespace.
\vVertical whitespace (linefeeds, carriage returns, etc.).
\VNon vertical whitespace.
\wLetters, digits, and the underscore (_).
\WEverything but letters, digits, and the underscore.

Quantifiers

A quantifier is used to specify repetitions.  Any character, metacharacter, subpattern, or character class can be followed by a quantifier.

The asterisk (*) quantifier allows for 0 or more repetitions.  For example, "a*" matches "", "a", "aa", "aaa", etc.

The plus (+) quantifier allows for 1 or more repetitions.  For example, "(ab)+" matches "ab", "abab", "ababab", etc.

The question mark (?) quantifier allows for 0 or 1 repetitions.  For example, "c?" matches "" or "c"

The curly brace ({}) quantifier allows for a specific number of repetitions.  It can specify a range of repetitions, such as "{2,5}", or a lower bound with no limit, such as "{3,}".  For example, "d{3,4}" matches "ddd" or "dddd", and "e{4,}" matches "eeee", "eeeee", "eeeeee", etc.

Subpatterns

Part of a regular expression can be enclosed in parenthesis.  This is referred to as a "subpattern".  This allows a series of characters to be repeated using a quantifier.  For example, "(abc)+" matches "abc", "abcabc", "abcabcabc", etc.

Another important use for subpatterns is specifying alternatives using the vertical bar (|).  For example, "(abc|def)" matches either "abc" or "def".

Character Classes

Square brackets ([]) are used to specify a "character class", which matches a single character that is a member of a set.

The simplest character class is simply a list of allowed characters.  For example, "[abc]" matches either "a", "b", or "c".

It is also possible to specify a range of characters using a dash.  For example, "[d-f]" matches either "d", "e", or "f".

If the first character is a circumflex (^), the character class is negated.  For example, "[^aeiou]" matches any character that is not a vowel.

Character classes can include backslashed character types and properties.  For example, "[\d\d\p{Lu}]" matches two digits followed by an upper case letter.

Character Properties

Every character belongs to one or more general property classes.  You can match a character in the class with "\p{property}" where property is one of the following:
Property Description
COther
CcControl
CfFormat
CnAssigned
CoPrivate use
CsSurrogate (U+D800 to U+DFFF)
LLetter
LlLower case letter
LmModifier letter
LoOther letter
LtTitle case letter
LuUpper case letter
L&Lower, upper, or title case letter
MMark
McSpacing mark
MeEnclosing mark
MnNon-spacing mark
NNumber
NdDecimal number
NlLetter number
NoOther number
PPunctuation
PcConnector punctuation
PdDash punctuation
PeClose punctuation
PfFinal punctuation
PiInitial punctuation
PoOther punctuation
PsOpen punctuation
SSymbol
ScCurrency symbol
SkModifier symbol
SmMathematical symbol
SoOther symbol
ZSeparator
ZlLine separator
ZpParagraph separator
ZsSpace separator

For example, "\p{Sm}" will match any mathematical symbol, such as "+" or "=".

You can negate a property test by using a capital P instead.  For example, "\P{N}" will match any non-number.

Assertions

An "assertion" specifies a condition that must be met, without consuming any characters from the subject.  The basic assertions are:
Assertion Description
^Start of string/line.
$End of string/line.
\AStart of the subject.
\bAt a word boundary.
\BNot at a word boundary.
\zEnd of the subject.
\ZEnd of the subject, or before a newline at the end of the subject.

Replacement Strings

When performing a search and replace operation, it is possible for the replacement string to make references to the matched portion of the subject.  The entire match can be represented with "$0".  For example, if the match is "red", and the replacement string is "dark $0", then the match will be replaced with "dark red".

It is also possible to refer to a subpattern that was enclosed in parenthesis.  Each subpattern is assigned an index, beginning at 1, from left to right.  For example, "$1" refers to the first subpattern, "$2" the second, and so on.

Consider the search string "((black|blue) (Mustang|Camaro))" and the replacement string "$1,$2,$3".  If the subject is "The blue Mustang is fast.", after the search and replace the subject would change to "blue Mustang,blue,Mustang".  $1 refers to the outermost parenthesis, which was encountered first.  $2 refers to "(black|blue)" which was encountered next.  And $3 refers to "(Mustang|Camaro)" which was encountered last.