Diff for "RegularExpressionsWithDyalogAndDotNet"

Differences between revisions 8 and 9

Regular expressions can be used in 2 2 tit 1 3 17 test T 0 1st Ken Ken [1] [2] ∇ Ken APL through .Net

.Net regular expressions are based on that of Perl and are compatible with Perl 5 regular expressions. .Net contains a set of powerful classes that makes it even easier to use regular expressions.

The following is a list of classes in the namespace:

Capture: Represents the results from a single subexpression capture. Capture represents one substring for a single successful capture.

CaptureCollection: Represents a sequence of capture substrings. CaptureCollection returns the set of captures done by a single capturing group.

Group: Group represents the results from a single capturing group. A capturing group can capture zero, one, or more strings in a single match because of quantifiers, so Group supplies a collection of Capture objects.

GroupCollection: Represents a collection of captured groups. GroupCollection returns the set of captured groups in a single match.

Match: Represents the results from a single regular expression match.

MatchCollection: Represents the set of successful matches found by iteratively applying a regular expression pattern to the input string.

Regex: Represents an immutable (read only) regular expression.

Here are a few examples on how to use them. The reader is assumed to be familiar with regexes, their syntax, groupings, etc.

      ⎕USING←'System.Text.RegularExpressions,system.dll' ⍝ This where the Regex class resides ⎕wx ⎕io←3 1 ⍝ There are 2 matching functions: <match> and <matches> ⍝ Let's start with the <Matches> function: ⍝ This function deals with all the matches, regardless of grouping: m←Regex.Matches 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Count       ⍝ only 2 matches, not 4 2 m[0 1].Index  ⍝ they start at offset 2 and 6 (4 overlaps) 6 (⌷m).Index    ⍝ more succinctly 6 ⍝ Another example text←'"tit for tat" said that fat and tall top cat' p1←'[ct].{0,3}[pt]' ⍝ find 'c' or 't' followed by 0 to 3 characters then by 'p' or 't' m←Regex.Matches text p1 m.Count ⍝ 5 found 5 ⌷m     ⍝ these are objects, not strings tat  that  top  cat DISPLAY ⍕¨⌷m ┌───┬───┬────┬───┬───┐ │tit│tat│that│top│cat│ └───┴───┴────┴───┴───┘ (⌷m).Index 9 19 37 41 (⌷m).Length 3 4 3 3 ⍝ Let's see the <Match> function: m←Regex.Match 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Success 1 m.Index 2 m←m.NextMatch m.Success 1 m.Index 6 m←m.NextMatch m.Success 0 ⍝ Let's capture groups with the <Match> function. ⍝ We're looking for names that have 4 sections separated by _ like a_b_c_d text←' a b_c+de_fg_hij÷kl_bnm_iop_good-qq_21_z9_not_this*5' ⍝                      [  this word  ] pattern←'\b([a-z0-9]+_){3}([a-z0-9]+)\b' ⍝          [ group  1 ]   [ group 2 ] - group 0 is the entire match +m←Regex.Match text pattern kl_bnm_iop_good m.(Index Length) 15 m.Groups.Count  ⍝ groups 0 (all), 1 & 2. 3 m.Groups[2] good

WARNING: some characters are treated in a special way in Dyalog. In perticular, the caret, used in regexes, appears twice in ⎕AV and care must be taken to use the right one. For example, the caret can mean two things in a regular expression:

1. at the beginning of a pattern it means 'pattern starts at the beginning'

2. as the first character inside [sets] it means negate

The caret used in regexes for that purpose is found at []AV[235] ([]IO 0). The one used for the APL function AND is found at []AV[167]. Thus to look for a line starting with ABC and not followed by D or E you would use the pattern '^ABC[^DE]' which would be constructed as

      ⎕AV[235],'ABC[',⎕AV[235],DE]'

Options .Net allows some searches to be conducted in a different manner. The main options are CultureInvariant, IgnoreCase, IgnorePatternWhitespace, Multiline, Singleline

To use options use the RegexOptions class as in RegexOptions.Multiline. To use several options simply add them up: RegexOptions.(Multiline+IgnoreCase)

Other examples.

Looking for an IP address: (4 numbers of up to 3 digits separated by dot)

      Regex.Matches text '(\d{1,3}\.){3}\d{1,3}' ⍝ A simple IP address finder

Looking for H1 text:

      pattern←'<h1>(.*?)</h1>'     ⍝ group 1 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[1]

Looking for text between ANY tag:

      pattern←'<(\w+)>(.*?)</\1>'  ⍝ group 2 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[2]

Looking for an APL identifier (including system names):

      ok←0≤⎕NC 256 1⍴⎕AV  ⍝ find all name forming characters: ∆, ⍙, Á, etc. r←'a-zA-Z',(ok/⎕AV)~,⎕AV[(⎕AV⍳'Aa')∘.+⍳26] ⍝ The pattern is any of those characters, followed by 0 or more of the same characters plus digits ⍝ and not preceded by : (for those :statements). No accounting for quotes or comments here. pattern←'(⎕?(?<!\s:)\b[',r,'][',r,'0-9]*|⍺⍺|⍺|⍵⍵|⍵)' ⍝ include alpha and omega ⌷m←Regex.Matches (⎕VR'test') pattern  ⍝ find all names in the function <test> ⎕IO  comment  label  cond  l10  Áqwe

You can use named groups instead of numered groups (the default):

      pattern←'<h1>(?<STR>.*?)</h1>'     ⍝ group 'STR' is to contain the text in between (Regex.Match 'aaaaz<h1>Title</H1>sad ' pattern RegexOptions.IgnoreCase).Groups[⊂'STR'] itle

The <IsMatch> function returns 1 if the pattern is found.

Example: Validate password conditions such as: "Password must be from 8 to 20 characters, must contain at least 2 letters and at least 2 digits. It can only contain letters and digits."

      p←⎕new Regex,⊂⊂⎕AV[236],'(?=.*?\d.*?\d)(?=(.*?[a-zA-Z]){2,})[\da-zA-Z]{8,20}$' p.IsMatch∘⊂¨'ds' ' 32a ' '0123456789x' '01234abcde56789wxyzp1'  '0123aAzZ' 0 0 0 1

You can split a string into substring using the <Split> function. This is sort of like the complement of <Matches> where it does not return the matches but everything else.

Example: split where X is followed by a digit:

      ⌷m←Regex.Split '1stX1aaaX2bbbX3cccX4ddd' 'X\d' aaa  bbb  ccc  ddd

Using regular expressions to replace strings.

This is easily done. You specify the pattern, text and how to replace using $n to denote group 'n'.

Example 1 Change "Surname, Name" into "Name Surname" (and account for spaces):

      pat←⎕new Regex (⊂'\s*(\w+)\s*,\s*(\w+)\s*') ⍝ group 1 is surname, group 2 is name pat.Replace '  Iverson,   Ken ' '$2 $1' Iverson ⍝ or, for a one time event: Regex.Replace'  Iverson,   Ken ' '\s*(\w+)\s*,\s*(\w+)\s*' '$2 $1'

You can also use named groups instead of numbers.

      Regex.Replace'  Iverson,   Ken ' '\s*(?<Last>\w+)\s*,\s*(?<First>\w+)\s*' '${First} ${Last}' Iverson

Example 2 If you need special treatment to be done you can use your own function to perform the replacement using a MatchEvaluator . You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs. For example if you want example 1 to ensure only the first letter is capitalized you can write

    ∇ str←cap arg str←(ToUpper arg.Groups[3].Value),ToLower arg.Groups[4].Value str,←' ',(ToUpper arg.Groups[1].Value),ToLower,arg.Groups[2].Value

capor←⎕NEW MatchEvaluator (⎕or'cap') pat←⎕NEW Regex,⊂⊂'\s*(\w)(\w*)\s*,\s*(\w)(\w*)\s*' ⍝ groups 1 & 3 are 1st name letters pat.Replace '  iVErson,   kEn ' capor Iverson

Author: Dan Baronet

CategoryDyalogExamplesDotNet

RegularExpressionsWithDyalogAndDotNet (last edited 2015-04-05 01:25:27 by PierreGilbert)

-  ⇤ ← Revision 8 as of 2007-06-25 08:50:42 → 
  Size: 1757
  Editor: KaiJaeger
  Comment:
+   ← Revision 9 as of 2007-06-26 11:01:57 → ⇥
  Size: 8397
  Editor: anonymous
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
-.Net comes with build-in regular expression. Here some potentially useful functions are implemented. One is checking an email address for being valid, the other one takes a url and returns "{!ProtocolName}:{!PortNo}" as a result.
+Regular expressions can be used in APL through .Net
 Line 4:
-The examples are taken from the C# help file Visual Studio Express comes with. The changes needed to make them work under Dyalog APL/W are not a big deal.
+.Net regular expressions are based on that of Perl and are compatible with Perl 5 regular expressions. 
.Net contains a set of powerful classes that makes it even easier to use regular expressions. 

The following is a list of classes in the namespace:

Capture: Represents the results from a single subexpression capture. Capture represents one substring
for a single successful capture.
 
CaptureCollection: Represents a sequence of capture substrings. CaptureCollection returns the set of
captures done by a single capturing group.
 
Group: Group represents the results from a single capturing group. A capturing group can capture zero, one,
or more strings in a single match because of quantifiers, so Group supplies a collection of Capture objects.
 
GroupCollection: Represents a collection of captured groups. GroupCollection returns the set of captured
groups in a single match.
 
Match: Represents the results from a single regular expression match.
 
MatchCollection: Represents the set of successful matches found by iteratively applying a regular
expression pattern to the input string.
 
Regex: Represents an immutable (read only) regular expression.
 
Here are a few examples on how to use them. The reader is assumed to be familiar with regexes, their syntax,
groupings, etc.
{{{
      ⎕USING←'System.Text.RegularExpressions,system.dll' ⍝ This where the Regex class resides
      ⎕wx ⎕io←3 1

      ⍝ There are 2 matching functions: <match> and <matches>
      ⍝ Let's start with the <Matches> function:
      ⍝ This function deals with all the matches, regardless of grouping:

      m←Regex.Matches 'xxababababa' 'aba' ⍝ this function is non-overlapping
      m.Count       ⍝ only 2 matches, not 4
2
      m[0 1].Index  ⍝ they start at offset 2 and 6 (4 overlaps)
2 6
      (⌷m).Index    ⍝ more succinctly
2 6
      ⍝ Another example
      text←'"tit for tat" said that fat and tall top cat'
      p1←'[ct].{0,3}[pt]' ⍝ find 'c' or 't' followed by 0 to 3 characters then by 'p' or 't'
      m←Regex.Matches text p1
      m.Count ⍝ 5 found
5
      ⌷m     ⍝ these are objects, not strings
 tit  tat  that  top  cat 
      DISPLAY ⍕¨⌷m
┌───┬───┬────┬───┬───┐
│tit│tat│that│top│cat│
└───┴───┴────┴───┴───┘
      (⌷m).Index  
1 9 19 37 41
      (⌷m).Length
3 3 4 3 3

      ⍝ Let's see the <Match> function:
      m←Regex.Match 'xxababababa' 'aba' ⍝ this function is non-overlapping
      m.Success
1
      m.Index
2
      m←m.NextMatch
      m.Success
1
      m.Index
6
      m←m.NextMatch
      m.Success
0
      ⍝ Let's capture groups with the <Match> function.
      ⍝ We're looking for names that have 4 sections separated by _ like a_b_c_d
      text←' a b_c+de_fg_hij÷kl_bnm_iop_good-qq_21_z9_not_this*5'
      ⍝                      [  this word  ]
      pattern←'\b([a-z0-9]+_){3}([a-z0-9]+)\b'
      ⍝          [ group  1 ]   [ group 2 ] - group 0 is the entire match
      +m←Regex.Match text pattern
kl_bnm_iop_good
      m.(Index Length)
17 15
      m.Groups.Count  ⍝ groups 0 (all), 1 & 2. 
3
      m.Groups[2]
good
}}}
WARNING: some characters are treated in a special way in Dyalog. In perticular, the caret, used in
regexes, appears twice in ⎕AV and care must be taken to use the right one. For example, the caret can
mean two things in a regular expression:

1. at the beginning of a pattern it means 'pattern starts at the beginning'

2. as the first character inside [sets] it means negate

The caret used in regexes for that purpose is found at []AV[235] ([]IO 0). The one used for the APL
function AND is found at []AV[167]. Thus to look for a line starting with ABC and not followed by D or E
you would use the pattern {{{'^ABC[^DE]'}}} which would be constructed as
{{{
      ⎕AV[235],'ABC[',⎕AV[235],DE]'
}}}
Options
.Net allows some searches to be conducted in a different manner. The main options are
CultureInvariant, IgnoreCase, IgnorePatternWhitespace, Multiline, Singleline

To use options use the RegexOptions class as in RegexOptions.Multiline. 
To use several options simply add them up: RegexOptions.(Multiline+IgnoreCase)

Other examples.

Looking for an IP address: (4 numbers of up to 3 digits separated by dot)
{{{
      Regex.Matches text '(\d{1,3}\.){3}\d{1,3}' ⍝ A simple IP address finder
}}}
Looking for H1 text:
{{{
      pattern←'<h1>(.*?)</h1>'     ⍝ group 1 contains the text in between
      (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[1]
}}}
Looking for text between ANY tag:
{{{
      pattern←'<(\w+)>(.*?)</\1>'  ⍝ group 2 contains the text in between
      (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[2]
}}}
Looking for an APL identifier (including system names):
{{{
      ok←0≤⎕NC 256 1⍴⎕AV  ⍝ find all name forming characters: ∆, ⍙, Á, etc.
      r←'a-zA-Z',(ok/⎕AV)~,⎕AV[(⎕AV⍳'Aa')∘.+⍳26]
     ⍝ The pattern is any of those characters, followed by 0 or more of the same characters plus digits
     ⍝ and not preceded by : (for those :statements). No accounting for quotes or comments here.
      pattern←'(⎕?(?<!\s:)\b[',r,'][',r,'0-9]*|⍺⍺|⍺|⍵⍵|⍵)' ⍝ include alpha and omega
      ⌷m←Regex.Matches (⎕VR'test') pattern  ⍝ find all names in the function <test>
 test  ⎕IO  comment  label  cond  l10  Áqwe 
}}}
You can use named groups instead of numered groups (the default):
{{{
      pattern←'<h1>(?<STR>.*?)</h1>'     ⍝ group 'STR' is to contain the text in between
      (Regex.Match 'aaaaz<h1>Title</H1>sad ' pattern RegexOptions.IgnoreCase).Groups[⊂'STR']
Title
}}}
The <IsMatch> function returns 1 if the pattern is found.

Example:
Validate password conditions such as: "Password must be from 8 to 20 characters, must contain at least 2 letters and at least 2 digits. It can only contain letters and digits."
{{{
      p←⎕new Regex,⊂⊂⎕AV[236],'(?=.*?\d.*?\d)(?=(.*?[a-zA-Z]){2,})[\da-zA-Z]{8,20}$'
      p.IsMatch∘⊂¨'ds' ' 32a ' '0123456789x' '01234abcde56789wxyzp1'  '0123aAzZ'
0 0 0 0 1
}}}
You can split a string into substring using the <Split> function.
This is sort of like the complement of <Matches> where it does not return the matches but everything else.

Example: split where X is followed by a digit:
{{{
      ⌷m←Regex.Split '1stX1aaaX2bbbX3cccX4ddd' 'X\d'
 1st  aaa  bbb  ccc  ddd 
}}}
Using regular expressions to replace strings.

This is easily done. You specify the pattern, text and how to replace using $n to denote group 'n'.

Example 1
Change "Surname, Name" into "Name Surname" (and account for spaces):
{{{
      pat←⎕new Regex (⊂'\s*(\w+)\s*,\s*(\w+)\s*') ⍝ group 1 is surname, group 2 is name
      pat.Replace '  Iverson,   Ken ' '$2 $1'
Ken Iverson
      ⍝ or, for a one time event:
      Regex.Replace'  Iverson,   Ken ' '\s*(\w+)\s*,\s*(\w+)\s*' '$2 $1'
}}}
You can also use named groups instead of numbers.
-Line 7:
+Line 177:
-r←IsValidEmailAddr emailAdr;regPattern
⍝⍝ Returns 1 if the string contains a
⍝⍝ valid email address, otherwise 0
 ⎕ML ⎕IO←1
 ⎕USING,←⊂''
 regPattern←'^([\w-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$'
 r←System.Text.RegularExpressions.Regex.IsMatch(emailAdr regPattern)
+      Regex.Replace'  Iverson,   Ken ' '\s*(?<Last>\w+)\s*,\s*(?<First>\w+)\s*' '${First} ${Last}'
Ken Iverson
}}}
Example 2
If you need special treatment to be done you can use your own function to perform the replacement using
a MatchEvaluator .
You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs.
For example if you want example 1 to ensure only the first letter is capitalized you can write
{{{
    ∇ str←cap arg
[1]   str←(ToUpper arg.Groups[3].Value),ToLower arg.Groups[4].Value
[2]   str,←' ',(ToUpper arg.Groups[1].Value),ToLower,arg.Groups[2].Value
    ∇                                                                                                                                                               
      capor←⎕NEW MatchEvaluator (⎕or'cap')
      pat←⎕NEW Regex,⊂⊂'\s*(\w)(\w*)\s*,\s*(\w)(\w*)\s*' ⍝ groups 1 & 3 are 1st name letters
      pat.Replace '  iVErson,   kEn ' capor
Ken Iverson
-Line 16:
+Line 196:
-{{{TestEmailAddr;msg
 msg←{⍵:'Is fine: ',⍺ ⋄ 'Is NOT fine: ',⍺}
 {⍵ msg IsValidEmailAddr ⍵}'valid@isfine.com'
 {⍵ msg IsValidEmailAddr ⍵}'valid@notsofine'
 {⍵ msg IsValidEmailAddr ⍵}'not$$valid@isfine.com'
}}}

returns

Is fine: ...

Is NOT fine: ...

Is NOT fine: ...

{{{
r←ProtocolAndPortFrom url;regPattern;q
⍝⍝ The following code example uses Match.Result to extract a protocol and port number from a URL
⍝⍝ For example, "http://www.contoso.com:8080/letters/readme.html" would return "http:8080".

 ⎕ML ⎕IO←1
 ⎕USING,←⊂''
 regPattern←'^(?<proto>\w+)://[^/]+?(?<port>:\d+)?/'
 r←(System.Text.RegularExpressions.Regex.Match(url regPattern)).Result⊂'${proto}${port}'
}}}

{{{
TestProtocolAndPort
 ⎕←ProtocolAndPortFrom'http://www.contoso.com:8080/letters/readme.html'
}}}

returns

`http:8080`

Author: Kai Jaeger
+Author: Dan Baronet