APL is greatl!

## page was renamed from RegularExpression
## page was renamed from DotNetSamples/RegularExpression
<<TableOfContents>>

= Regular Expressions in Dyalog APL =

Regular expressions can be used in Dyalog APL through .Net. The following code was prepared with V11 of Dyalog.

== Introduction ==

Note that this article is not about regular expressions as such: instead the reader is assumed to be familiar with regexes, their syntax, groupings, etc.

.Net regular expressions are based on that of Perl and are compatible with Perl 5 regular expressions. .Net contains a set of powerful classes that makes it even easier to use regular expressions. 

The following is a list of classes in the namespace:

 Capture:: Represents the results from a single subexpression capture. Capture represents one substring for a single successful capture.
 
 CaptureCollection:: Represents a sequence of capture substrings. !CaptureCollection returns the set of captures done by a single capturing group.
 
 Group:: Group represents the results from a single capturing group. A capturing group can capture zero, one, or more strings in a single match because of quantifiers, so Group supplies a collection of Capture objects.
 
 GroupCollection:: Represents a collection of captured groups. !GroupCollection returns the set of captured groups in a single match.
 
 Match:: Represents the results from a single regular expression match.
 
 MatchCollection:: Represents the set of successful matches found by iteratively applying a regular expression pattern to the input string.
 
 Regex:: Represents an immutable (read only) regular expression.

== Examples ==
 
Here are a few examples on how to use them. 

{{{
      ⎕USING←'System.Text.RegularExpressions,system.dll' ⍝ This where the Regex class resides
      ⎕wx ⎕io←3 0

      ⍝ There are 2 matching functions: <match> and <matches>
      ⍝ Let's start with the <Matches> function:
      ⍝ This function deals with all the matches, regardless of grouping:

      m←Regex.Matches 'xxababababa' 'aba' ⍝ this function is non-overlapping
      m.Count       ⍝ only 2 matches, not 4
2
      m[0 1].Index  ⍝ they start at offset 2 and 6 (4 overlaps)
2 6
      (⌷m).Index    ⍝ more succinctly
2 6
      ⍝ Another example
      text←'"tit for tat" said that fat and tall top cat'
      p1←'[ct].{0,3}[pt]' ⍝ find 'c' or 't' followed by 0 to 3 characters then by 'p' or 't'
      m←Regex.Matches text p1
      m.Count ⍝ 5 found
5
      ⌷m     ⍝ these are objects, not strings
 tit  tat  that  top  cat 
      DISPLAY ⍕¨⌷m
┌───┬───┬────┬───┬───┐
│tit│tat│that│top│cat│
└───┴───┴────┴───┴───┘
      (⌷m).Index  
1 9 19 37 41
      (⌷m).Length
3 3 4 3 3

      ⍝ Let's see the <Match> function:
      m←Regex.Match 'xxababababa' 'aba' ⍝ this function is non-overlapping
      m.Success
1
      m.Index
2
      m←m.NextMatch
      m.Success
1
      m.Index
6
      m←m.NextMatch
      m.Success
0
      ⍝ Let's capture groups with the <Match> function.
      ⍝ We're looking for names that have 4 sections separated by _ like a_b_c_d
      text←' a b_c+de_fg_hij÷kl_bnm_iop_good-qq_21_z9_not_this*5'
      ⍝                                     [     this    word     ]
      pattern←'\b([a-z0-9]+_){3}([a-z0-9]+)\b'
      ⍝          [  group 1 ]   [ group 2 ] - group 0 is the entire match
      +m←Regex.Match text pattern
kl_bnm_iop_good
      m.(Index Length)
17 15
      m.Groups.Count  ⍝ groups 0 (all), 1 & 2. 
3
      m.Groups[2]
good
}}}

== Warning! ==

Some characters are treated in a special way in Dyalog. In particular, the caret, used in regexes, appears twice in ⎕AV and care must be taken to use the right one. The Classic version of Dyalog does not offer a way to enter both characters distinctly from the keyboard.

In a regular expression, apart from meaning "a caret", the caret can mean two things:

 * at the beginning of a pattern it means 'pattern starts at the beginning'

 * as the first character inside [sets] it means negate

The caret used in regexes for that purpose is found at []AV[235] ([]IO 0). The one used for the APL function AND is found at []AV[167]. Thus to look for a line starting with ABC and not followed by D or E you would use the pattern {{{'^ABC[^DE]'}}} which would be constructed as

{{{
      ⎕AV[235],'ABC[',⎕AV[235],'DE]'
}}}

In the Unicode version the 2 characters are distinct and they can be entered directly from the keyboard.

== Options ==

.Net allows some searches to be conducted in a different manner. The main options are !CultureInvariant, !IgnoreCase, !IgnorePatternWhitespace, Multiline, Singleline

To use options use the !RegexOptions class as in !RegexOptions.Multiline. To use several options simply add them up: !RegexOptions.(Multiline+!IgnoreCase)

== Other examples ==

Looking for an IP address: (4 numbers of up to 3 digits separated by dot)

{{{
      text←'Dan, 192.168.1.2, foo-foo'
      z←Regex.Matches text '(\d{1,3}\.){3}\d{1,3}'
      z[0]
192.168.1.2      
}}}

Looking for H1 text:

{{{
      text←'<h1>APL is greatl!</h1>'
      pattern←'<h1>(.*?)</h1>'     ⍝ group 1 contains the text in between
      (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[1]
APL is greatl!
}}}

Looking for text between '''any''' tag:

{{{
      text←'<p>APL is powerful!</p>'
      pattern←'<(\w+)>(.*?)</\1>'  ⍝ group 2 contains the text in between
      (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[2]
APL is powerful!
}}}

Looking for an APL identifier (including system names):

{{{
     ∇ test;local
[1]    global←1 ⋄ local←2
[2]   label:⎕IO←1
[3]    :If 1          ⍝ ⍺
[4]        ∆special∆←1 ⋄ special⍙←2
[5]        _special←3 ⋄ Áspecial←4
[6]    :EndIf
     ∇

      ok←0≤⎕NC 256 1⍴⎕AV  ⍝ find all name forming characters: ∆, ⍙, Á, etc.
      r←'a-zA-Z',(ok/⎕AV)~,⎕AV[(⎕AV⍳'Aa')∘.-1-⍳26]

     ⍝ The pattern is any of those characters, followed by 0 or more of the same characters plus digits
     ⍝ and not preceded by : (for those :statements). No accounting for quotes or comments here.
      pattern←'((?<!\s:)(?<![',r,'])⎕?[',r,'][',r,'0-9]*|⍺⍺|⍺|⍵⍵|⍵)' ⍝ include alpha and omega
      ⌷m←Regex.Matches (⎕VR'test') pattern  ⍝ find all names in the function <test>
 test  local  global  local  label  ⎕IO  ⍺  ∆special∆  special⍙  _special  Áspecial 
}}}

You can use named groups instead of numered groups (the default):

{{{
      pattern←'<h1>(?<STR>.*?)</h1>'     ⍝ group 'STR' is to contain the text in between
      (Regex.Match 'aaaaz<h1>Title</H1>sad ' pattern RegexOptions.IgnoreCase).Groups[⊂'STR']
Title
}}}

The <!IsMatch> function returns 1 if the pattern is found ANYWHERE.

Example: Validate password conditions such as: "Password must be from 8 to 20 characters, must contain at least 2 letters and at least 2 digits. It can only contain letters and digits."

{{{
      p←⎕new Regex,⊂⊂⎕AV[235],'(?=.*?\d.*?\d)(?=(.*?[a-zA-Z]){2,})[\da-zA-Z]{8,20}$'
      p.IsMatch∘⊂¨'ds' ' 32a ' '0123456789x' '01234abcde56789wxyzp1'  '0123aAzZ'
0 0 0 0 1
}}}

You can split a string into substring using the <Split> function. This is sort of like the complement of <Matches> where it does not return the matches but everything else.

Example: split where X is followed by a digit:

{{{
      ⌷m←Regex.Split '1stX1aaaX2bbbX3cccX4ddd' 'X\d'
 1st  aaa  bbb  ccc  ddd 
}}}

== Using regular expressions to replace strings ==

You specify the pattern, text and how to replace using $n to denote group 'n'.

=== Example 1 ===

Change "Surname, Name" into "Name Surname" (and account for spaces):

{{{
      pat←⎕new Regex (⊂'\s*(\w+)\s*,\s*(\w+)\s*') ⍝ group 1 is surname, group 2 is name
      pat.Replace '  Iverson,   Ken ' '$2 $1'
Ken Iverson
      ⍝ or, for a one time event:
      Regex.Replace'  Iverson,   Ken ' '\s*(\w+)\s*,\s*(\w+)\s*' '$2 $1'
}}}

You can also use named groups instead of numbers.

{{{
      Regex.Replace'  Iverson,   Ken ' '\s*(?<Last>\w+)\s*,\s*(?<First>\w+)\s*' '${First} ${Last}'
Ken Iverson
}}}

=== Example 2 ===

If you need special treatment to be done you can use your own function to perform the replacement using a !MatchEvaluator. You can think of a !MatchEvaluator as an event handler that fires when an "!OnMatch event" occurs. For example if you want example 1 to ensure only the first letter is capitalized you can write

{{{
    ∇ str←cap arg
[1]   str←(ToUpper arg.Groups[3].Value),ToLower arg.Groups[4].Value
[2]   str,←' ',(ToUpper arg.Groups[1].Value),ToLower,arg.Groups[2].Value
    ∇                                                                                                                                                               
      capor←⎕NEW MatchEvaluator (⎕or'cap')
      pat←⎕NEW Regex,⊂⊂'\s*(\w)(\w*)\s*,\s*(\w)(\w*)\s*' ⍝ groups 1 & 3 are 1st name letters
      pat.Replace '  iVErson,   kEn ' capor
Ken Iverson
}}}

Author: DanBaronet

----
CategoryRegularExpressions - CategoryDotNet - CategoryDyalogDotNet - CategoryDyalogExamplesDotNet