Size: 1757
Comment:
|
Size: 8397
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 2: | Line 2: |
.Net comes with build-in regular expression. Here some potentially useful functions are implemented. One is checking an email address for being valid, the other one takes a url and returns "{!ProtocolName}:{!PortNo}" as a result. | Regular expressions can be used in APL through .Net |
Line 4: | Line 4: |
The examples are taken from the C# help file Visual Studio Express comes with. The changes needed to make them work under Dyalog APL/W are not a big deal. | .Net regular expressions are based on that of Perl and are compatible with Perl 5 regular expressions. .Net contains a set of powerful classes that makes it even easier to use regular expressions. The following is a list of classes in the namespace: Capture: Represents the results from a single subexpression capture. Capture represents one substring for a single successful capture. CaptureCollection: Represents a sequence of capture substrings. CaptureCollection returns the set of captures done by a single capturing group. Group: Group represents the results from a single capturing group. A capturing group can capture zero, one, or more strings in a single match because of quantifiers, so Group supplies a collection of Capture objects. GroupCollection: Represents a collection of captured groups. GroupCollection returns the set of captured groups in a single match. Match: Represents the results from a single regular expression match. MatchCollection: Represents the set of successful matches found by iteratively applying a regular expression pattern to the input string. Regex: Represents an immutable (read only) regular expression. Here are a few examples on how to use them. The reader is assumed to be familiar with regexes, their syntax, groupings, etc. {{{ ⎕USING←'System.Text.RegularExpressions,system.dll' ⍝ This where the Regex class resides ⎕wx ⎕io←3 1 ⍝ There are 2 matching functions: <match> and <matches> ⍝ Let's start with the <Matches> function: ⍝ This function deals with all the matches, regardless of grouping: m←Regex.Matches 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Count ⍝ only 2 matches, not 4 2 m[0 1].Index ⍝ they start at offset 2 and 6 (4 overlaps) 2 6 (⌷m).Index ⍝ more succinctly 2 6 ⍝ Another example text←'"tit for tat" said that fat and tall top cat' p1←'[ct].{0,3}[pt]' ⍝ find 'c' or 't' followed by 0 to 3 characters then by 'p' or 't' m←Regex.Matches text p1 m.Count ⍝ 5 found 5 ⌷m ⍝ these are objects, not strings tit tat that top cat DISPLAY ⍕¨⌷m ┌───┬───┬────┬───┬───┐ │tit│tat│that│top│cat│ └───┴───┴────┴───┴───┘ (⌷m).Index 1 9 19 37 41 (⌷m).Length 3 3 4 3 3 ⍝ Let's see the <Match> function: m←Regex.Match 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Success 1 m.Index 2 m←m.NextMatch m.Success 1 m.Index 6 m←m.NextMatch m.Success 0 ⍝ Let's capture groups with the <Match> function. ⍝ We're looking for names that have 4 sections separated by _ like a_b_c_d text←' a b_c+de_fg_hij÷kl_bnm_iop_good-qq_21_z9_not_this*5' ⍝ [ this word ] pattern←'\b([a-z0-9]+_){3}([a-z0-9]+)\b' ⍝ [ group 1 ] [ group 2 ] - group 0 is the entire match +m←Regex.Match text pattern kl_bnm_iop_good m.(Index Length) 17 15 m.Groups.Count ⍝ groups 0 (all), 1 & 2. 3 m.Groups[2] good }}} WARNING: some characters are treated in a special way in Dyalog. In perticular, the caret, used in regexes, appears twice in ⎕AV and care must be taken to use the right one. For example, the caret can mean two things in a regular expression: 1. at the beginning of a pattern it means 'pattern starts at the beginning' 2. as the first character inside [sets] it means negate The caret used in regexes for that purpose is found at []AV[235] ([]IO 0). The one used for the APL function AND is found at []AV[167]. Thus to look for a line starting with ABC and not followed by D or E you would use the pattern {{{'^ABC[^DE]'}}} which would be constructed as {{{ ⎕AV[235],'ABC[',⎕AV[235],DE]' }}} Options .Net allows some searches to be conducted in a different manner. The main options are CultureInvariant, IgnoreCase, IgnorePatternWhitespace, Multiline, Singleline To use options use the RegexOptions class as in RegexOptions.Multiline. To use several options simply add them up: RegexOptions.(Multiline+IgnoreCase) Other examples. Looking for an IP address: (4 numbers of up to 3 digits separated by dot) {{{ Regex.Matches text '(\d{1,3}\.){3}\d{1,3}' ⍝ A simple IP address finder }}} Looking for H1 text: {{{ pattern←'<h1>(.*?)</h1>' ⍝ group 1 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[1] }}} Looking for text between ANY tag: {{{ pattern←'<(\w+)>(.*?)</\1>' ⍝ group 2 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[2] }}} Looking for an APL identifier (including system names): {{{ ok←0≤⎕NC 256 1⍴⎕AV ⍝ find all name forming characters: ∆, ⍙, Á, etc. r←'a-zA-Z',(ok/⎕AV)~,⎕AV[(⎕AV⍳'Aa')∘.+⍳26] ⍝ The pattern is any of those characters, followed by 0 or more of the same characters plus digits ⍝ and not preceded by : (for those :statements). No accounting for quotes or comments here. pattern←'(⎕?(?<!\s:)\b[',r,'][',r,'0-9]*|⍺⍺|⍺|⍵⍵|⍵)' ⍝ include alpha and omega ⌷m←Regex.Matches (⎕VR'test') pattern ⍝ find all names in the function <test> test ⎕IO comment label cond l10 Áqwe }}} You can use named groups instead of numered groups (the default): {{{ pattern←'<h1>(?<STR>.*?)</h1>' ⍝ group 'STR' is to contain the text in between (Regex.Match 'aaaaz<h1>Title</H1>sad ' pattern RegexOptions.IgnoreCase).Groups[⊂'STR'] Title }}} The <IsMatch> function returns 1 if the pattern is found. Example: Validate password conditions such as: "Password must be from 8 to 20 characters, must contain at least 2 letters and at least 2 digits. It can only contain letters and digits." {{{ p←⎕new Regex,⊂⊂⎕AV[236],'(?=.*?\d.*?\d)(?=(.*?[a-zA-Z]){2,})[\da-zA-Z]{8,20}$' p.IsMatch∘⊂¨'ds' ' 32a ' '0123456789x' '01234abcde56789wxyzp1' '0123aAzZ' 0 0 0 0 1 }}} You can split a string into substring using the <Split> function. This is sort of like the complement of <Matches> where it does not return the matches but everything else. Example: split where X is followed by a digit: {{{ ⌷m←Regex.Split '1stX1aaaX2bbbX3cccX4ddd' 'X\d' 1st aaa bbb ccc ddd }}} Using regular expressions to replace strings. This is easily done. You specify the pattern, text and how to replace using $n to denote group 'n'. Example 1 Change "Surname, Name" into "Name Surname" (and account for spaces): {{{ pat←⎕new Regex (⊂'\s*(\w+)\s*,\s*(\w+)\s*') ⍝ group 1 is surname, group 2 is name pat.Replace ' Iverson, Ken ' '$2 $1' Ken Iverson ⍝ or, for a one time event: Regex.Replace' Iverson, Ken ' '\s*(\w+)\s*,\s*(\w+)\s*' '$2 $1' }}} You can also use named groups instead of numbers. |
Line 7: | Line 177: |
r←IsValidEmailAddr emailAdr;regPattern ⍝⍝ Returns 1 if the string contains a ⍝⍝ valid email address, otherwise 0 ⎕ML ⎕IO←1 ⎕USING,←⊂'' regPattern←'^([\w-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$' r←System.Text.RegularExpressions.Regex.IsMatch(emailAdr regPattern) |
Regex.Replace' Iverson, Ken ' '\s*(?<Last>\w+)\s*,\s*(?<First>\w+)\s*' '${First} ${Last}' Ken Iverson }}} Example 2 If you need special treatment to be done you can use your own function to perform the replacement using a MatchEvaluator . You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs. For example if you want example 1 to ensure only the first letter is capitalized you can write {{{ ∇ str←cap arg [1] str←(ToUpper arg.Groups[3].Value),ToLower arg.Groups[4].Value [2] str,←' ',(ToUpper arg.Groups[1].Value),ToLower,arg.Groups[2].Value ∇ capor←⎕NEW MatchEvaluator (⎕or'cap') pat←⎕NEW Regex,⊂⊂'\s*(\w)(\w*)\s*,\s*(\w)(\w*)\s*' ⍝ groups 1 & 3 are 1st name letters pat.Replace ' iVErson, kEn ' capor Ken Iverson |
Line 16: | Line 196: |
{{{TestEmailAddr;msg msg←{⍵:'Is fine: ',⍺ ⋄ 'Is NOT fine: ',⍺} {⍵ msg IsValidEmailAddr ⍵}'valid@isfine.com' {⍵ msg IsValidEmailAddr ⍵}'valid@notsofine' {⍵ msg IsValidEmailAddr ⍵}'not$$valid@isfine.com' }}} returns Is fine: ... Is NOT fine: ... Is NOT fine: ... {{{ r←ProtocolAndPortFrom url;regPattern;q ⍝⍝ The following code example uses Match.Result to extract a protocol and port number from a URL ⍝⍝ For example, "http://www.contoso.com:8080/letters/readme.html" would return "http:8080". ⎕ML ⎕IO←1 ⎕USING,←⊂'' regPattern←'^(?<proto>\w+)://[^/]+?(?<port>:\d+)?/' r←(System.Text.RegularExpressions.Regex.Match(url regPattern)).Result⊂'${proto}${port}' }}} {{{ TestProtocolAndPort ⎕←ProtocolAndPortFrom'http://www.contoso.com:8080/letters/readme.html' }}} returns `http:8080` Author: Kai Jaeger |
Author: Dan Baronet |
Regular expressions can be used in APL through .Net
.Net regular expressions are based on that of Perl and are compatible with Perl 5 regular expressions. .Net contains a set of powerful classes that makes it even easier to use regular expressions.
The following is a list of classes in the namespace:
Capture: Represents the results from a single subexpression capture. Capture represents one substring for a single successful capture.
CaptureCollection: Represents a sequence of capture substrings. CaptureCollection returns the set of captures done by a single capturing group.
Group: Group represents the results from a single capturing group. A capturing group can capture zero, one, or more strings in a single match because of quantifiers, so Group supplies a collection of Capture objects.
GroupCollection: Represents a collection of captured groups. GroupCollection returns the set of captured groups in a single match.
Match: Represents the results from a single regular expression match.
MatchCollection: Represents the set of successful matches found by iteratively applying a regular expression pattern to the input string.
Regex: Represents an immutable (read only) regular expression.
Here are a few examples on how to use them. The reader is assumed to be familiar with regexes, their syntax, groupings, etc.
⎕USING←'System.Text.RegularExpressions,system.dll' ⍝ This where the Regex class resides ⎕wx ⎕io←3 1 ⍝ There are 2 matching functions: <match> and <matches> ⍝ Let's start with the <Matches> function: ⍝ This function deals with all the matches, regardless of grouping: m←Regex.Matches 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Count ⍝ only 2 matches, not 4 2 m[0 1].Index ⍝ they start at offset 2 and 6 (4 overlaps) 2 6 (⌷m).Index ⍝ more succinctly 2 6 ⍝ Another example text←'"tit for tat" said that fat and tall top cat' p1←'[ct].{0,3}[pt]' ⍝ find 'c' or 't' followed by 0 to 3 characters then by 'p' or 't' m←Regex.Matches text p1 m.Count ⍝ 5 found 5 ⌷m ⍝ these are objects, not strings tit tat that top cat DISPLAY ⍕¨⌷m ┌───┬───┬────┬───┬───┐ │tit│tat│that│top│cat│ └───┴───┴────┴───┴───┘ (⌷m).Index 1 9 19 37 41 (⌷m).Length 3 3 4 3 3 ⍝ Let's see the <Match> function: m←Regex.Match 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Success 1 m.Index 2 m←m.NextMatch m.Success 1 m.Index 6 m←m.NextMatch m.Success 0 ⍝ Let's capture groups with the <Match> function. ⍝ We're looking for names that have 4 sections separated by _ like a_b_c_d text←' a b_c+de_fg_hij÷kl_bnm_iop_good-qq_21_z9_not_this*5' ⍝ [ this word ] pattern←'\b([a-z0-9]+_){3}([a-z0-9]+)\b' ⍝ [ group 1 ] [ group 2 ] - group 0 is the entire match +m←Regex.Match text pattern kl_bnm_iop_good m.(Index Length) 17 15 m.Groups.Count ⍝ groups 0 (all), 1 & 2. 3 m.Groups[2] good
WARNING: some characters are treated in a special way in Dyalog. In perticular, the caret, used in regexes, appears twice in ⎕AV and care must be taken to use the right one. For example, the caret can mean two things in a regular expression:
1. at the beginning of a pattern it means 'pattern starts at the beginning'
2. as the first character inside [sets] it means negate
The caret used in regexes for that purpose is found at []AV[235] ([]IO 0). The one used for the APL function AND is found at []AV[167]. Thus to look for a line starting with ABC and not followed by D or E you would use the pattern '^ABC[^DE]' which would be constructed as
⎕AV[235],'ABC[',⎕AV[235],DE]'
Options .Net allows some searches to be conducted in a different manner. The main options are CultureInvariant, IgnoreCase, IgnorePatternWhitespace, Multiline, Singleline
To use options use the RegexOptions class as in RegexOptions.Multiline. To use several options simply add them up: RegexOptions.(Multiline+IgnoreCase)
Other examples.
Looking for an IP address: (4 numbers of up to 3 digits separated by dot)
Regex.Matches text '(\d{1,3}\.){3}\d{1,3}' ⍝ A simple IP address finder
Looking for H1 text:
pattern←'<h1>(.*?)</h1>' ⍝ group 1 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[1]
Looking for text between ANY tag:
pattern←'<(\w+)>(.*?)</\1>' ⍝ group 2 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[2]
Looking for an APL identifier (including system names):
ok←0≤⎕NC 256 1⍴⎕AV ⍝ find all name forming characters: ∆, ⍙, Á, etc. r←'a-zA-Z',(ok/⎕AV)~,⎕AV[(⎕AV⍳'Aa')∘.+⍳26] ⍝ The pattern is any of those characters, followed by 0 or more of the same characters plus digits ⍝ and not preceded by : (for those :statements). No accounting for quotes or comments here. pattern←'(⎕?(?<!\s:)\b[',r,'][',r,'0-9]*|⍺⍺|⍺|⍵⍵|⍵)' ⍝ include alpha and omega ⌷m←Regex.Matches (⎕VR'test') pattern ⍝ find all names in the function <test> test ⎕IO comment label cond l10 Áqwe
You can use named groups instead of numered groups (the default):
pattern←'<h1>(?<STR>.*?)</h1>' ⍝ group 'STR' is to contain the text in between (Regex.Match 'aaaaz<h1>Title</H1>sad ' pattern RegexOptions.IgnoreCase).Groups[⊂'STR'] Title
The <IsMatch> function returns 1 if the pattern is found.
Example: Validate password conditions such as: "Password must be from 8 to 20 characters, must contain at least 2 letters and at least 2 digits. It can only contain letters and digits."
p←⎕new Regex,⊂⊂⎕AV[236],'(?=.*?\d.*?\d)(?=(.*?[a-zA-Z]){2,})[\da-zA-Z]{8,20}$' p.IsMatch∘⊂¨'ds' ' 32a ' '0123456789x' '01234abcde56789wxyzp1' '0123aAzZ' 0 0 0 0 1
You can split a string into substring using the <Split> function. This is sort of like the complement of <Matches> where it does not return the matches but everything else.
Example: split where X is followed by a digit:
⌷m←Regex.Split '1stX1aaaX2bbbX3cccX4ddd' 'X\d' 1st aaa bbb ccc ddd
Using regular expressions to replace strings.
This is easily done. You specify the pattern, text and how to replace using $n to denote group 'n'.
Example 1 Change "Surname, Name" into "Name Surname" (and account for spaces):
pat←⎕new Regex (⊂'\s*(\w+)\s*,\s*(\w+)\s*') ⍝ group 1 is surname, group 2 is name pat.Replace ' Iverson, Ken ' '$2 $1' Ken Iverson ⍝ or, for a one time event: Regex.Replace' Iverson, Ken ' '\s*(\w+)\s*,\s*(\w+)\s*' '$2 $1'
You can also use named groups instead of numbers.
Regex.Replace' Iverson, Ken ' '\s*(?<Last>\w+)\s*,\s*(?<First>\w+)\s*' '${First} ${Last}' Ken Iverson
Example 2 If you need special treatment to be done you can use your own function to perform the replacement using a MatchEvaluator . You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs. For example if you want example 1 to ensure only the first letter is capitalized you can write
∇ str←cap arg [1] str←(ToUpper arg.Groups[3].Value),ToLower arg.Groups[4].Value [2] str,←' ',(ToUpper arg.Groups[1].Value),ToLower,arg.Groups[2].Value ∇ capor←⎕NEW MatchEvaluator (⎕or'cap') pat←⎕NEW Regex,⊂⊂'\s*(\w)(\w*)\s*,\s*(\w)(\w*)\s*' ⍝ groups 1 & 3 are 1st name letters pat.Replace ' iVErson, kEn ' capor Ken Iverson
Author: Dan Baronet