Regular expressions can be used in APL through .Net

Regular Expressions

Introduction

Note that this article is not about regular expressions as such: instead the reader is assumed to be familiar with regexes, their syntax, groupings, etc.

.Net regular expressions are based on that of Perl and are compatible with Perl 5 regular expressions. .Net contains a set of powerful classes that makes it even easier to use regular expressions.

The following is a list of classes in the namespace:

Capture
Represents the results from a single subexpression capture. Capture represents one substring for a single successful capture.
CaptureCollection

Represents a sequence of capture substrings. CaptureCollection returns the set of captures done by a single capturing group.

Group
Group represents the results from a single capturing group. A capturing group can capture zero, one, or more strings in a single match because of quantifiers, so Group supplies a collection of Capture objects.
GroupCollection

Represents a collection of captured groups. GroupCollection returns the set of captured groups in a single match.

Match
Represents the results from a single regular expression match.
MatchCollection
Represents the set of successful matches found by iteratively applying a regular expression pattern to the input string.
Regex
Represents an immutable (read only) regular expression.

Examples

Here are a few examples on how to use them.

      ⎕USING←'System.Text.RegularExpressions,system.dll' ⍝ This where the Regex class resides
      ⎕wx ⎕io←3 1

      ⍝ There are 2 matching functions: <match> and <matches>
      ⍝ Let's start with the <Matches> function:
      ⍝ This function deals with all the matches, regardless of grouping:

      m←Regex.Matches 'xxababababa' 'aba' ⍝ this function is non-overlapping
      m.Count       ⍝ only 2 matches, not 4
2
      m[0 1].Index  ⍝ they start at offset 2 and 6 (4 overlaps)
2 6
      (⌷m).Index    ⍝ more succinctly
2 6
      ⍝ Another example
      text←'"tit for tat" said that fat and tall top cat'
      p1←'[ct].{0,3}[pt]' ⍝ find 'c' or 't' followed by 0 to 3 characters then by 'p' or 't'
      m←Regex.Matches text p1
      m.Count ⍝ 5 found
5
      ⌷m     ⍝ these are objects, not strings
 tit  tat  that  top  cat 
      DISPLAY ⍕¨⌷m
┌───┬───┬────┬───┬───┐
│tit│tat│that│top│cat│
└───┴───┴────┴───┴───┘
      (⌷m).Index  
1 9 19 37 41
      (⌷m).Length
3 3 4 3 3

      ⍝ Let's see the <Match> function:
      m←Regex.Match 'xxababababa' 'aba' ⍝ this function is non-overlapping
      m.Success
1
      m.Index
2
      m←m.NextMatch
      m.Success
1
      m.Index
6
      m←m.NextMatch
      m.Success
0
      ⍝ Let's capture groups with the <Match> function.
      ⍝ We're looking for names that have 4 sections separated by _ like a_b_c_d
      text←' a b_c+de_fg_hij÷kl_bnm_iop_good-qq_21_z9_not_this*5'
      ⍝                      [  this word  ]
      pattern←'\b([a-z0-9]+_){3}([a-z0-9]+)\b'
      ⍝          [ group  1 ]   [ group 2 ] - group 0 is the entire match
      +m←Regex.Match text pattern
kl_bnm_iop_good
      m.(Index Length)
17 15
      m.Groups.Count  ⍝ groups 0 (all), 1 & 2. 
3
      m.Groups[2]
good

Warning!

Some characters are treated in a special way in Dyalog. In particular, the caret, used in regexes, appears twice in ⎕AV and care must be taken to use the right one. For example, the caret can mean two things in a regular expression:

The caret used in regexes for that purpose is found at []AV[235] ([]IO 0). The one used for the APL function AND is found at []AV[167]. Thus to look for a line starting with ABC and not followed by D or E you would use the pattern '^ABC[^DE]' which would be constructed as

      ⎕AV[235],'ABC[',⎕AV[235],'DE]'

Options

.Net allows some searches to be conducted in a different manner. The main options are CultureInvariant, IgnoreCase, IgnorePatternWhitespace, Multiline, Singleline

To use options use the RegexOptions class as in RegexOptions.Multiline. To use several options simply add them up: RegexOptions.(Multiline+IgnoreCase)

Other examples

Looking for an IP address: (4 numbers of up to 3 digits separated by dot)

      text←'Dan, 192.168.1.2, foo-foo'
      z←Regex.Matches text '(\d{1,3}\.){3}\d{1,3}'
      z[0]
192.168.1.2      

Looking for H1 text:

      text←'<h1>APL is powerful!</h1>'
      pattern←'<h1>(.*?)</h1>'     ⍝ group 1 contains the text in between
      (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[1]

Looking for text between any tag:

      text←'<p>APL is powerful!</p>'
      pattern←'<(\w+)>(.*?)</\1>'  ⍝ group 2 contains the text in between
      (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[2]
APL is powerful!

Looking for an APL identifier (including system names):

     ∇ test;local
[1]    global←1 ⋄ local←2
[2]   label:⎕IO←1
[3]    :If 1          ⍝ ⍺
[4]        ∆special∆←1 ⋄ special⍙←2
[5]        _special←3 ⋄ Áspecial←4
[6]    :EndIf

      ok←0≤⎕NC 256 1⍴⎕AV  ⍝ find all name forming characters: ∆, ⍙, Á, etc.
      r←'a-zA-Z',(ok/⎕AV)~,⎕AV[(⎕AV⍳'Aa')∘.-1-⍳26]

     ⍝ The pattern is any of those characters, followed by 0 or more of the same characters plus digits
     ⍝ and not preceded by : (for those :statements). No accounting for quotes or comments here.
      pattern←'((?<!\s:)(?<![',r,'])⎕?[',r,'][',r,'0-9]*|⍺⍺|⍺|⍵⍵|⍵)' ⍝ include alpha and omega
      ⌷m←Regex.Matches (⎕VR'test') pattern  ⍝ find all names in the function <test>
 test  local  global  local  label  ⎕IO  ⍺  ∆special∆  special⍙  _special  Áspecial 

You can use named groups instead of numered groups (the default):

      pattern←'<h1>(?<STR>.*?)</h1>'     ⍝ group 'STR' is to contain the text in between
      (Regex.Match 'aaaaz<h1>Title</H1>sad ' pattern RegexOptions.IgnoreCase).Groups[⊂'STR']
Title

The <IsMatch> function returns 1 if the pattern is found.

Example: Validate password conditions such as: "Password must be from 8 to 20 characters, must contain at least 2 letters and at least 2 digits. It can only contain letters and digits."

      p←⎕new Regex,⊂⊂⎕AV[236],'(?=.*?\d.*?\d)(?=(.*?[a-zA-Z]){2,})[\da-zA-Z]{8,20}$'
      p.IsMatch∘⊂¨'ds' ' 32a ' '0123456789x' '01234abcde56789wxyzp1'  '0123aAzZ'
0 0 0 0 1

You can split a string into substring using the <Split> function. This is sort of like the complement of <Matches> where it does not return the matches but everything else.

Example: split where X is followed by a digit:

      ⌷m←Regex.Split '1stX1aaaX2bbbX3cccX4ddd' 'X\d'
 1st  aaa  bbb  ccc  ddd 

Using regular expressions to replace strings

You specify the pattern, text and how to replace using $n to denote group 'n'.

Example 1

Change "Surname, Name" into "Name Surname" (and account for spaces):

      pat←⎕new Regex (⊂'\s*(\w+)\s*,\s*(\w+)\s*') ⍝ group 1 is surname, group 2 is name
      pat.Replace '  Iverson,   Ken ' '$2 $1'
Ken Iverson
      ⍝ or, for a one time event:
      Regex.Replace'  Iverson,   Ken ' '\s*(\w+)\s*,\s*(\w+)\s*' '$2 $1'

You can also use named groups instead of numbers.

      Regex.Replace'  Iverson,   Ken ' '\s*(?<Last>\w+)\s*,\s*(?<First>\w+)\s*' '${First} ${Last}'
Ken Iverson

Example 2

If you need special treatment to be done you can use your own function to perform the replacement using a MatchEvaluator. You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs. For example if you want example 1 to ensure only the first letter is capitalized you can write

    ∇ str←cap arg
[1]   str←(ToUpper arg.Groups[3].Value),ToLower arg.Groups[4].Value
[2]   str,←' ',(ToUpper arg.Groups[1].Value),ToLower,arg.Groups[2].Value
      capor←⎕NEW MatchEvaluator (⎕or'cap')
      pat←⎕NEW Regex,⊂⊂'\s*(\w)(\w*)\s*,\s*(\w)(\w*)\s*' ⍝ groups 1 & 3 are 1st name letters
      pat.Replace '  iVErson,   kEn ' capor
Ken Iverson

Author: Dan Baronet


CategoryDyalogExamplesDotNet