Contents
Regular Expressions in Dyalog APL
Regular expressions can be used in Dyalog APL through .Net. The following code was prepared with V11 of Dyalog.
Introduction
Note that this article is not about regular expressions as such: instead the reader is assumed to be familiar with regexes, their syntax, groupings, etc.
.Net regular expressions are based on that of Perl and are compatible with Perl 5 regular expressions. .Net contains a set of powerful classes that makes it even easier to use regular expressions.
The following is a list of classes in the namespace:
- Capture
- Represents the results from a single subexpression capture. Capture represents one substring for a single successful capture.
- CaptureCollection
Represents a sequence of capture substrings. CaptureCollection returns the set of captures done by a single capturing group.
- Group
- Group represents the results from a single capturing group. A capturing group can capture zero, one, or more strings in a single match because of quantifiers, so Group supplies a collection of Capture objects.
- GroupCollection
Represents a collection of captured groups. GroupCollection returns the set of captured groups in a single match.
- Match
- Represents the results from a single regular expression match.
- MatchCollection
- Represents the set of successful matches found by iteratively applying a regular expression pattern to the input string.
- Regex
- Represents an immutable (read only) regular expression.
Examples
Here are a few examples on how to use them.
⎕USING←'System.Text.RegularExpressions,system.dll' ⍝ This where the Regex class resides ⎕wx ⎕io←3 0 ⍝ There are 2 matching functions: <match> and <matches> ⍝ Let's start with the <Matches> function: ⍝ This function deals with all the matches, regardless of grouping: m←Regex.Matches 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Count ⍝ only 2 matches, not 4 2 m[0 1].Index ⍝ they start at offset 2 and 6 (4 overlaps) 2 6 (⌷m).Index ⍝ more succinctly 2 6 ⍝ Another example text←'"tit for tat" said that fat and tall top cat' p1←'[ct].{0,3}[pt]' ⍝ find 'c' or 't' followed by 0 to 3 characters then by 'p' or 't' m←Regex.Matches text p1 m.Count ⍝ 5 found 5 ⌷m ⍝ these are objects, not strings tit tat that top cat DISPLAY ⍕¨⌷m ┌───┬───┬────┬───┬───┐ │tit│tat│that│top│cat│ └───┴───┴────┴───┴───┘ (⌷m).Index 1 9 19 37 41 (⌷m).Length 3 3 4 3 3 ⍝ Let's see the <Match> function: m←Regex.Match 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Success 1 m.Index 2 m←m.NextMatch m.Success 1 m.Index 6 m←m.NextMatch m.Success 0 ⍝ Let's capture groups with the <Match> function. ⍝ We're looking for names that have 4 sections separated by _ like a_b_c_d text←' a b_c+de_fg_hij÷kl_bnm_iop_good-qq_21_z9_not_this*5' ⍝ [ this word ] pattern←'\b([a-z0-9]+_){3}([a-z0-9]+)\b' ⍝ [ group 1 ] [ group 2 ] - group 0 is the entire match +m←Regex.Match text pattern kl_bnm_iop_good m.(Index Length) 17 15 m.Groups.Count ⍝ groups 0 (all), 1 & 2. 3 m.Groups[2] good
Warning!
Some characters are treated in a special way in Dyalog. In particular, the caret, used in regexes, appears twice in ⎕AV and care must be taken to use the right one. The Classic version of Dyalog does not offer a way to enter both characters distinctly from the keyboard.
In a regular expression, apart from meaning "a caret", the caret can mean two things:
- at the beginning of a pattern it means 'pattern starts at the beginning'
- as the first character inside [sets] it means negate
The caret used in regexes for that purpose is found at []AV[235] ([]IO 0). The one used for the APL function AND is found at []AV[167]. Thus to look for a line starting with ABC and not followed by D or E you would use the pattern '^ABC[^DE]' which would be constructed as
⎕AV[235],'ABC[',⎕AV[235],'DE]'
In the Unicode version the 2 characters are distinct and they can be entered directly from the keyboard.
Options
.Net allows some searches to be conducted in a different manner. The main options are CultureInvariant, IgnoreCase, IgnorePatternWhitespace, Multiline, Singleline
To use options use the RegexOptions class as in RegexOptions.Multiline. To use several options simply add them up: RegexOptions.(Multiline+IgnoreCase)
Other examples
Looking for an IP address: (4 numbers of up to 3 digits separated by dot)
text←'Dan, 192.168.1.2, foo-foo' z←Regex.Matches text '(\d{1,3}\.){3}\d{1,3}' z[0] 192.168.1.2
Looking for H1 text:
text←'<h1>APL is greatl!</h1>' pattern←'<h1>(.*?)</h1>' ⍝ group 1 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[1] APL is greatl!
Looking for text between any tag:
text←'<p>APL is powerful!</p>' pattern←'<(\w+)>(.*?)</\1>' ⍝ group 2 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[2] APL is powerful!
Looking for an APL identifier (including system names):
∇ test;local [1] global←1 ⋄ local←2 [2] label:⎕IO←1 [3] :If 1 ⍝ ⍺ [4] ∆special∆←1 ⋄ special⍙←2 [5] _special←3 ⋄ Áspecial←4 [6] :EndIf ∇ ok←0≤⎕NC 256 1⍴⎕AV ⍝ find all name forming characters: ∆, ⍙, Á, etc. r←'a-zA-Z',(ok/⎕AV)~,⎕AV[(⎕AV⍳'Aa')∘.-1-⍳26] ⍝ The pattern is any of those characters, followed by 0 or more of the same characters plus digits ⍝ and not preceded by : (for those :statements). No accounting for quotes or comments here. pattern←'((?<!\s:)(?<![',r,'])⎕?[',r,'][',r,'0-9]*|⍺⍺|⍺|⍵⍵|⍵)' ⍝ include alpha and omega ⌷m←Regex.Matches (⎕VR'test') pattern ⍝ find all names in the function <test> test local global local label ⎕IO ⍺ ∆special∆ special⍙ _special Áspecial
You can use named groups instead of numered groups (the default):
pattern←'<h1>(?<STR>.*?)</h1>' ⍝ group 'STR' is to contain the text in between (Regex.Match 'aaaaz<h1>Title</H1>sad ' pattern RegexOptions.IgnoreCase).Groups[⊂'STR'] Title
The <IsMatch> function returns 1 if the pattern is found ANYWHERE.
Example: Validate password conditions such as: "Password must be from 8 to 20 characters, must contain at least 2 letters and at least 2 digits. It can only contain letters and digits."
p←⎕new Regex,⊂⊂⎕AV[235],'(?=.*?\d.*?\d)(?=(.*?[a-zA-Z]){2,})[\da-zA-Z]{8,20}$' p.IsMatch∘⊂¨'ds' ' 32a ' '0123456789x' '01234abcde56789wxyzp1' '0123aAzZ' 0 0 0 0 1
You can split a string into substring using the <Split> function. This is sort of like the complement of <Matches> where it does not return the matches but everything else.
Example: split where X is followed by a digit:
⌷m←Regex.Split '1stX1aaaX2bbbX3cccX4ddd' 'X\d' 1st aaa bbb ccc ddd
Using regular expressions to replace strings
You specify the pattern, text and how to replace using $n to denote group 'n'.
Example 1
Change "Surname, Name" into "Name Surname" (and account for spaces):
pat←⎕new Regex (⊂'\s*(\w+)\s*,\s*(\w+)\s*') ⍝ group 1 is surname, group 2 is name pat.Replace ' Iverson, Ken ' '$2 $1' Ken Iverson ⍝ or, for a one time event: Regex.Replace' Iverson, Ken ' '\s*(\w+)\s*,\s*(\w+)\s*' '$2 $1'
You can also use named groups instead of numbers.
Regex.Replace' Iverson, Ken ' '\s*(?<Last>\w+)\s*,\s*(?<First>\w+)\s*' '${First} ${Last}' Ken Iverson
Example 2
If you need special treatment to be done you can use your own function to perform the replacement using a MatchEvaluator. You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs. For example if you want example 1 to ensure only the first letter is capitalized you can write
∇ str←cap arg [1] str←(ToUpper arg.Groups[3].Value),ToLower arg.Groups[4].Value [2] str,←' ',(ToUpper arg.Groups[1].Value),ToLower,arg.Groups[2].Value ∇ capor←⎕NEW MatchEvaluator (⎕or'cap') pat←⎕NEW Regex,⊂⊂'\s*(\w)(\w*)\s*,\s*(\w)(\w*)\s*' ⍝ groups 1 & 3 are 1st name letters pat.Replace ' iVErson, kEn ' capor Ken Iverson
Author: DanBaronet
CategoryRegularExpressions - CategoryDotNet - CategoryDyalogDotNet - CategoryDyalogExamplesDotNet