Size: 8536
Comment:
|
Size: 8776
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 103: | Line 103: |
⎕AV[235],'ABC[',⎕AV[235],DE]' | ⎕AV[235],'ABC[',⎕AV[235],'DE]' |
Line 117: | Line 117: |
Regex.Matches text '(\d{1,3}\.){3}\d{1,3}' ⍝ A simple IP address finder | text←'Dan, 192.168.1.2, foo-foo' z←Regex.Matches text '(\d{1,3}\.){3}\d{1,3}' {⍵/⍨⍵∊⎕d,'.'}{⍵/⍨3≥+\'.'=⍵}z[0].Index↓text 192.168.1.2 |
Line 123: | Line 126: |
text←'<h1>APL is powerful!</h1>' | |
Line 127: | Line 131: |
Looking for text between ANY tag: {{{ |
Looking for text between '''any''' tag: {{{ text←'<h1>caption</h1><p>paragraph</p><ol><li>first</li><li>second</li></ol>' |
Regular expressions can be used in APL through .Net
Regular Expressions
Introduction
.Net regular expressions are based on that of Perl and are compatible with Perl 5 regular expressions. .Net contains a set of powerful classes that makes it even easier to use regular expressions.
The following is a list of classes in the namespace:
- Capture
- Represents the results from a single subexpression capture. Capture represents one substring for a single successful capture.
- CaptureCollection
Represents a sequence of capture substrings. CaptureCollection returns the set of captures done by a single capturing group.
- Group
- Group represents the results from a single capturing group. A capturing group can capture zero, one, or more strings in a single match because of quantifiers, so Group supplies a collection of Capture objects.
- GroupCollection
Represents a collection of captured groups. GroupCollection returns the set of captured groups in a single match.
- Match
- Represents the results from a single regular expression match.
- MatchCollection
- Represents the set of successful matches found by iteratively applying a regular expression pattern to the input string.
- Regex
- Represents an immutable (read only) regular expression.
Examples
Here are a few examples on how to use them. The reader is assumed to be familiar with regexes, their syntax, groupings, etc.
⎕USING←'System.Text.RegularExpressions,system.dll' ⍝ This where the Regex class resides ⎕wx ⎕io←3 1 ⍝ There are 2 matching functions: <match> and <matches> ⍝ Let's start with the <Matches> function: ⍝ This function deals with all the matches, regardless of grouping: m←Regex.Matches 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Count ⍝ only 2 matches, not 4 2 m[0 1].Index ⍝ they start at offset 2 and 6 (4 overlaps) 2 6 (⌷m).Index ⍝ more succinctly 2 6 ⍝ Another example text←'"tit for tat" said that fat and tall top cat' p1←'[ct].{0,3}[pt]' ⍝ find 'c' or 't' followed by 0 to 3 characters then by 'p' or 't' m←Regex.Matches text p1 m.Count ⍝ 5 found 5 ⌷m ⍝ these are objects, not strings tit tat that top cat DISPLAY ⍕¨⌷m ┌───┬───┬────┬───┬───┐ │tit│tat│that│top│cat│ └───┴───┴────┴───┴───┘ (⌷m).Index 1 9 19 37 41 (⌷m).Length 3 3 4 3 3 ⍝ Let's see the <Match> function: m←Regex.Match 'xxababababa' 'aba' ⍝ this function is non-overlapping m.Success 1 m.Index 2 m←m.NextMatch m.Success 1 m.Index 6 m←m.NextMatch m.Success 0 ⍝ Let's capture groups with the <Match> function. ⍝ We're looking for names that have 4 sections separated by _ like a_b_c_d text←' a b_c+de_fg_hij÷kl_bnm_iop_good-qq_21_z9_not_this*5' ⍝ [ this word ] pattern←'\b([a-z0-9]+_){3}([a-z0-9]+)\b' ⍝ [ group 1 ] [ group 2 ] - group 0 is the entire match +m←Regex.Match text pattern kl_bnm_iop_good m.(Index Length) 17 15 m.Groups.Count ⍝ groups 0 (all), 1 & 2. 3 m.Groups[2] good
Warning!
Some characters are treated in a special way in Dyalog. In particular, the caret, used in regexes, appears twice in ⎕AV and care must be taken to use the right one. For example, the caret can mean two things in a regular expression:
- at the beginning of a pattern it means 'pattern starts at the beginning'
- as the first character inside [sets] it means negate
The caret used in regexes for that purpose is found at []AV[235] ([]IO 0). The one used for the APL function AND is found at []AV[167]. Thus to look for a line starting with ABC and not followed by D or E you would use the pattern '^ABC[^DE]' which would be constructed as
⎕AV[235],'ABC[',⎕AV[235],'DE]'
Options
.Net allows some searches to be conducted in a different manner. The main options are CultureInvariant, IgnoreCase, IgnorePatternWhitespace, Multiline, Singleline
To use options use the RegexOptions class as in RegexOptions.Multiline. To use several options simply add them up: RegexOptions.(Multiline+IgnoreCase)
Other examples
Looking for an IP address: (4 numbers of up to 3 digits separated by dot)
text←'Dan, 192.168.1.2, foo-foo' z←Regex.Matches text '(\d{1,3}\.){3}\d{1,3}' {⍵/⍨⍵∊⎕d,'.'}{⍵/⍨3≥+\'.'=⍵}z[0].Index↓text 192.168.1.2
Looking for H1 text:
text←'<h1>APL is powerful!</h1>' pattern←'<h1>(.*?)</h1>' ⍝ group 1 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[1]
Looking for text between any tag:
text←'<h1>caption</h1><p>paragraph</p><ol><li>first</li><li>second</li></ol>' pattern←'<(\w+)>(.*?)</\1>' ⍝ group 2 contains the text in between (Regex.Match text pattern RegexOptions.IgnoreCase).Groups[2]
Looking for an APL identifier (including system names):
ok←0≤⎕NC 256 1⍴⎕AV ⍝ find all name forming characters: ∆, ⍙, Á, etc. r←'a-zA-Z',(ok/⎕AV)~,⎕AV[(⎕AV⍳'Aa')∘.+⍳26] ⍝ The pattern is any of those characters, followed by 0 or more of the same characters plus digits ⍝ and not preceded by : (for those :statements). No accounting for quotes or comments here. pattern←'(⎕?(?<!\s:)\b[',r,'][',r,'0-9]*|⍺⍺|⍺|⍵⍵|⍵)' ⍝ include alpha and omega ⌷m←Regex.Matches (⎕VR'test') pattern ⍝ find all names in the function <test> test ⎕IO comment label cond l10 Áqwe
You can use named groups instead of numered groups (the default):
pattern←'<h1>(?<STR>.*?)</h1>' ⍝ group 'STR' is to contain the text in between (Regex.Match 'aaaaz<h1>Title</H1>sad ' pattern RegexOptions.IgnoreCase).Groups[⊂'STR'] Title
The <IsMatch> function returns 1 if the pattern is found.
Example: Validate password conditions such as: "Password must be from 8 to 20 characters, must contain at least 2 letters and at least 2 digits. It can only contain letters and digits."
p←⎕new Regex,⊂⊂⎕AV[236],'(?=.*?\d.*?\d)(?=(.*?[a-zA-Z]){2,})[\da-zA-Z]{8,20}$' p.IsMatch∘⊂¨'ds' ' 32a ' '0123456789x' '01234abcde56789wxyzp1' '0123aAzZ' 0 0 0 0 1
You can split a string into substring using the <Split> function. This is sort of like the complement of <Matches> where it does not return the matches but everything else.
Example: split where X is followed by a digit:
⌷m←Regex.Split '1stX1aaaX2bbbX3cccX4ddd' 'X\d' 1st aaa bbb ccc ddd
Using regular expressions to replace strings
You specify the pattern, text and how to replace using $n to denote group 'n'.
Example 1
Change "Surname, Name" into "Name Surname" (and account for spaces):
pat←⎕new Regex (⊂'\s*(\w+)\s*,\s*(\w+)\s*') ⍝ group 1 is surname, group 2 is name pat.Replace ' Iverson, Ken ' '$2 $1' Ken Iverson ⍝ or, for a one time event: Regex.Replace' Iverson, Ken ' '\s*(\w+)\s*,\s*(\w+)\s*' '$2 $1'
You can also use named groups instead of numbers.
Regex.Replace' Iverson, Ken ' '\s*(?<Last>\w+)\s*,\s*(?<First>\w+)\s*' '${First} ${Last}' Ken Iverson
Example 2
If you need special treatment to be done you can use your own function to perform the replacement using a MatchEvaluator. You can think of a MatchEvaluator as an event handler that fires when an "OnMatch event" occurs. For example if you want example 1 to ensure only the first letter is capitalized you can write
∇ str←cap arg [1] str←(ToUpper arg.Groups[3].Value),ToLower arg.Groups[4].Value [2] str,←' ',(ToUpper arg.Groups[1].Value),ToLower,arg.Groups[2].Value ∇ capor←⎕NEW MatchEvaluator (⎕or'cap') pat←⎕NEW Regex,⊂⊂'\s*(\w)(\w*)\s*,\s*(\w)(\w*)\s*' ⍝ groups 1 & 3 are 1st name letters pat.Replace ' iVErson, kEn ' caporKen Iverson
Author: Dan Baronet