Regular Expressions in Dyalog APL

Regular expressions can be used in Dyalog APL through .Net. The following code was prepared with V11 of Dyalog.

Introduction

Note that this article is not about regular expressions as such: instead the reader is assumed to be familiar with regexes, their syntax, groupings, etc.

.Net regular expressions are based on that of Perl and are compatible with Perl 5 regular expressions. .Net contains a set of powerful classes that makes it even easier to use regular expressions.

The following is a list of classes in the namespace:

Capture
Represents the results from a single subexpression capture. Capture represents one substring for a single successful capture.
CaptureCollection

Represents a sequence of capture substrings. CaptureCollection returns the set of captures done by a single capturing group.

Group
Group represents the results from a single capturing group. A capturing group can capture zero, one, or more strings in a single match because of quantifiers, so Group supplies a collection of Capture objects.
GroupCollection

Represents a collection of captured groups. GroupCollection returns the set of captured groups in a single match.

Match
Represents the results from a single regular expression match.
MatchCollection
Represents the set of successful matches found by iteratively applying a regular expression pattern to the input string.
Regex
Represents an immutable (read only) regular expression.

Examples

Here are a few examples on how to use them.

      ⎕USING←'System.Text.RegularExpressions,system.dll' ⍝ This where the Regex class resides
      ⎕wx ⎕io←3 0

      ⍝ There are 2 matching functions: <match> and <matches>
      ⍝ Let's start with the <Matches> function:
      ⍝ This function deals with all the matches, regardless of grouping:

      m←Regex.Matches 'xxababababa' 'aba' ⍝ this function is non-overlapping
      m.Count       ⍝ only 2 matches, not 4
2
      m[0 1].Index  ⍝ they start at offset 2 and 6 (4 overlaps)
2 6
      (⌷m).Index    ⍝ more succinctly
2 6
      ⍝ Another example
      text←'"tit for tat" said that fat and tall top cat'
      p1←'[ct].{0,3}[pt]' ⍝ find 'c' or 't' followed by 0 to 3 characters then by 'p' or 't'
      m←Regex.Matches text p1
      m.Count ⍝ 5 found
5
      ⌷m     ⍝ these are objects, not strings
 tit  tat  that  top  cat 
      DISPLAY ⍕¨⌷m
┌───┬───┬────┬───┬───┐
│tit│tat│that│top│cat│
└───┴───┴────┴───┴───┘
      (⌷m).Index  
1 9 19 37 41
      (⌷m).Length
3 3 4 3 3

      ⍝ Let's see the <Match> function:
      m←Regex.Match 'xxababababa' 'aba' ⍝ this function is non-overlapping
      m.Success
1
      m.Index
2
      m←m.NextMatch
      m.Success
1
      m.Index
6
      m←m.NextMatch
      m.Success
0
      ⍝ Let's capture groups with the <Match> function.
      ⍝ We're looking for names that have 4 sections separated by _ like a_b_c_d
      text←' a b_c+de_fg_hij÷kl_bnm_iop_good-qq_21_z9_not_this*5'
      ⍝                                     [     this    word     ]
      pattern←'\b([a-z0-9]+_){3}([a-z0-9]+)\b'
      ⍝          [  group 1 ]   [ group 2 ] - group 0 is the entire match
      +m←Regex.Match text pattern
kl_bnm_iop_good
      m.(Index Length)
17 15
      m.Groups.Count  ⍝ groups 0 (all), 1 & 2. 
3
      m.Groups[2]
good

Warning!

Some characters are treated in a special way in Dyalog. In particular, the caret, used in regexes, appears twice in ⎕AV and care must be taken to use the right one. The Classic version of Dyalog does not offer a way to enter both characters distinctly from the keyboard.

In a regular expression, apart from meaning "a caret", the caret can mean two things:

The caret used in regexes for that purpose is found at []AV[235] ([]IO 0). The one used for the APL function AND is found at []AV[167]. Thus to look for a line starting with ABC and not followed by D or E you would use the pattern '^ABC[^DE]' which would be constructed as

      ⎕AV[235],'ABC[',⎕AV[235],'DE]'

In the Unicode version the 2 characters are distinct and they can be entered directly from the keyboard.

Options

.Net allows some searches to be conducted in a different manner. The main options are CultureInvariant, IgnoreCase, IgnorePatternWhitespace, Multiline, Singleline

To use options use the RegexOptions class as in RegexOptions.Multiline. To use several options simply add them up: RegexOptions.(Multiline+IgnoreCase)

Other examples

Looking for an IP address: (4 numbers of up to 3 digits separated by dot)

      text←'Dan, 192.168.1.2, foo-foo'
      z←Regex.Matches text '(\d{1,3}\.){3}\d{1,3}'
      z[0]
192.168.1.2      

Looking for H1 text:

      text←'<h1>APL is greatl!</h1>'
      pattern←'<h1>(.*?)</h1>'     ⍝ group 1 contains the text in between
      (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[1]
APL is greatl!

Looking for text between any tag:

      text←'<p>APL is powerful!</p>'
      pattern←'<(\w+)>(.*?)</\1>'  ⍝ group 2 contains the text in between
      (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[2]
APL is powerful!

Looking for an APL identifier (including system names):

     ∇ test;local
[1]    global←1 ⋄ local←2
[2]   label:⎕IO←1
[3]    :If 1          ⍝ ⍺
[4]        ∆special∆←1 ⋄ special⍙←2
[5]        _special←3 ⋄ Áspecial←4
[6]    :EndIf

      ok←0≤⎕NC 256 1⍴⎕AV  ⍝ find all name forming characters: ∆, ⍙, Á, etc.
      r←'a-zA-Z',(ok/⎕AV)~,⎕AV[(⎕AV⍳'Aa')∘.-1-⍳26]

     ⍝ The pattern is any of those characters, followed by 0 or more of the same characters plus digits
     ⍝ and not preceded by : (for those :statements). No accounting for quotes or comments here.
      pattern←'((?<!\s:)(?<![',r,'])⎕?[',r,'][',r,'0-9]*|⍺⍺|⍺|⍵⍵|⍵)' ⍝ include alpha and omega
      ⌷m←Regex.Matches (⎕VR'test') pattern  ⍝ find all names in the function <test>
 test  local  global  local  label  ⎕IO  ⍺  ∆special∆  special⍙  _special  Áspecial 

You can use named groups instead of numered groups (the default):

      pattern←'<h1>(?<STR>.*?)</h1>'     ⍝ group 'STR' is to contain the text in between
      (Regex.Match 'aaaaz<h1>Title</H1>sad ' pattern RegexOptions.IgnoreCase).Groups[⊂'STR']
Title

The <IsMatch> function returns 1 if the pattern is found ANYWHERE.

Example: Validate password conditions such as: "Password must be from 8 to 20 characters, must contain at least 2 letters and at least 2 digits. It can only contain letters and digits."

      p←⎕new Regex,⊂⊂⎕AV[235],'(?=.*?\d.*?\d)(?=(.*?[a-zA-Z]){2,})[\da-zA-Z]{8,20}$'
      p.IsMatch∘⊂¨'ds' ' 32a ' '0123456789x' '01234abcde56789wxyzp1'  '0123aAzZ'
0 0 0 0 1

You can split a string into substring using the <Split> function. This is sort of like the complement of <Matches> where it does not return the matches but everything else.

Example: split where X is followed by a digit:

      ⌷m←Regex.Split '1stX1aaaX2bbbX3cccX4ddd' 'X\d'
 1st  aaa  bbb  ccc  ddd 

Using regular expressions to replace strings

You specify the pattern, text and how to replace using $n to denote group 'n'.

Example 1

Change "Surname, Name" into "Name Surname" (and account for spaces):

      pat←⎕new Regex (⊂'\s*(\w+)\s*,\s*(\w+)\s*') ⍝ group 1 is surname, group 2 is name
      pat.Replace '  Iverson,   Ken ' '$2 $1'
Ken Iverson
      ⍝ or, for a one time event:
      Regex.Replace'  Iverson,   Ken ' '\s*(\w+)\s*,\s*(\w+)\s*' '$2 $1'

You can also use named groups instead of numbers.

      Regex.Replace'  Iverson,   Ken ' '\s*(?<Last>\w+)\s*,\s*(?<First>\w+)\s*' '${First} ${Last}'
Ken Iverson

Example 2

If you need special treatment to be done you can use your own function to perform the replacement using a MatchEvaluator. You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs. For example if you want example 1 to ensure only the first letter is capitalized you can write

    ∇ str←cap arg
[1]   str←(ToUpper arg.Groups[3].Value),ToLower arg.Groups[4].Value
[2]   str,←' ',(ToUpper arg.Groups[1].Value),ToLower,arg.Groups[2].Value
      capor←⎕NEW MatchEvaluator (⎕or'cap')
      pat←⎕NEW Regex,⊂⊂'\s*(\w)(\w*)\s*,\s*(\w)(\w*)\s*' ⍝ groups 1 & 3 are 1st name letters
      pat.Replace '  iVErson,   kEn ' capor
Ken Iverson

Author: DanBaronet


CategoryRegularExpressions - CategoryDotNet - CategoryDyalogDotNet - CategoryDyalogExamplesDotNet

RegularExpressionsWithDyalogAndDotNet (last edited 2015-04-05 01:25:27 by PierreGilbert)