Differences between revisions 25 and 36 (spanning 11 versions)
Revision 25 as of 2009-02-02 10:06:50
Size: 17798
Editor: SimonMarsden
Comment: Minor changes
Revision 36 as of 2017-02-16 19:31:38
Size: 13202
Editor: KaiJaeger
Comment:
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
'''This article is currently under construction '''
Line 10: Line 8:
All code in this article is supposed to work with APL2, APL+WIN, APLX and Dyalog APL. There are minor differences, these will be mentioned.

== Run APL and use the Session Manager ==
In principle, all code in this article is supposed to work with APL2, APL+WIN, Dyalog APL and NARS2000. There are minor differences, these will be mentioned.

== Run APL and use the session manager ==
Line 20: Line 18:
 {{{ {{{
Line 46: Line 44:
== APL: Powerful, Short, Concise == == APL: powerful, short, concise ==
Line 54: Line 52:
To make life a bit easier we are going to deal with well-formed code only. So let's take code from a website which is definitely supposed to provide well-formed code, although most HTML pages on the web are still syntactically incorrect:

http://www.w3.org/
To make life a bit easier we are going to deal with well-formed code only. So let's take code from a website which provides well-formed code, although most HTML pages on the web are still syntactically incorrect:

http://download.aplwiki.com/apltree/
Line 60: Line 58:
{{attachment:w3c.gif}}

Right-click on the page and select "View Source" from the context menu. In a separate window you will now see something similar to this:

{{attachment:source.gif}}

Put the focus onto the text, then press Ctrl+A to select the entire HTML code and finally press Ctrl+C to copy it into the clipboard. Now we return to APL's session window. We need to create a new variable which is supposed to hold the HTML code we have just copied. This is how we do it in the different flavours of APL:

<<SeeSaw(section="dyalog1", toshow="<<Show>> Dyalog", tohide="<<Hide>> Dyalog", bg="#FEE1A5", speed="Slow")>>

{{{{#!wiki seesaw/dyalog1/dyalog1-bg/hide

Make sure that you enter the following statement '''without''' pressing <Enter>:
{{{
)ed myHtml
}}}

While "myHtml" is the name of the variable we are going to create, ")ed" is a bit special: the closing parenthesis tells APL that this statement is a system command. The following two characters (ed) then tell APL to invoke the editor.

Now we can insert the HTML code from the clipboard by selecting the paste command from the toolbar:

{{attachment:paste_1.gif}}

Now press enter. We will see something like this:

{{attachment:ed_1.gif}}
}}}}

<<SeeSaw(section="aplx1", toshow="<<Show>> APLX", tohide="<<Hide>> APLX", bg="#FEE1A5", speed="Slow")>>

{{{{#!wiki seesaw/aplx1/aplx1-bg/hide

{{{
myHtml←'⎕' ⎕wi 'text'
}}}

In APLX, a system function can be used to access the contents of the clipboard. System functions as well as system variables start their names with a {{{⎕}}} character which is called Quad, so here we are using the Quad-WI (Windowing Interface) system function. As left argument we specify another Quad in quotes, standing for system. As right argument we specify the property name we are interested in.

We can display the variable in the editor with
{{{
)Edit myHtml
}}}

We will see something like this:

{{attachment:ed_1b.gif}}

However, the format is not what we would like to have: currently the variable is a string. The former records in the HTML-code are separated from each other by new-line characters. With the following statement we can reformat the array:

{{{
myHtml←(myHtml≠⎕av[14])⊂myHtml
}}}

As a result we get a vector of strings, and at the same time we got rid of all the empty lines in the original HTML code.
}}}}

<<SeeSaw(section="aplplus1", toshow="<<Show>> APL+WIN", tohide="<<Hide>> APL+WIN", bg="#FEE1A5", speed="Slow")>>

{{{{#!wiki seesaw/aplplus1/aplplus1-bg/hide

Once we have the HTML code on the clip board the easiest way to get it into a variable in the workspace is to open a new vector via the editor and paste in the contents of the clip board. Then simply save the vector with the name myHtml.

As with the APLX example we can reformat the vector into a vector of strings using the same code:

{{{
myHtml←(myHtml≠⎕av[14])⊂myHtml
}}}

Whilst strictly outside the scope of this article more experienced APLers might find it a lot easier to create an instance of the Microsoft XMLHTTP server control com object and download the html code as a text string directly into a variable bypassing all the browser and clip board operations. The APL+WIN code to do this is:

{{{
 ∇ r←GetUrl url
                                                                            
⍝Create an instance of the http server control
⎕wself←'HTTP' ⎕wi 'Create' 'MSXML2.ServerXMLHTTP'
                                                                            
⍝Open the control and create a get request from the url
⎕wi 'XOpen' 'GET' url 0
                                                                            
⍝Send the get request
⎕wi 'XSend'
                                                                            
⍝Check that the text is available
r←⎕wi 'xstatusText'
                                                                            
⍝If the text is available output it and delete the control
:if 2=+/r∊'OK'
    r←⎕wi 'xresponseText'
:endif
                                                                            
⍝If the text is not available close the control and output the error message
'No text available.'
⎕wi 'Delete'


}}}

In this particular case the line separators in the downloaded text are line feed characters so the code to download the HTML and reformat the text string as a vector of strings is:

{{{
myHtml←(myHtml≠⎕av[11])⊂myHtml←GetUrl 'http://www.w3.org/'
}}}

When using this approach from within an APL application to extract information from websites directy it is necessary to become familiar with the site HTML so that you can not only strip out the HTML tags but you can also filter out just the information you need using techniques you can develop from the basic code explained in the body of this article.

}}}}

=== Examine the HTML Code ===

After selecting "Exit" from the "File" menu APL will establish the variable in what is called a "workspace". Let's examine the length of the variable we have just created. For this purpose there is a function called "shape", represented by the {{{⍴}}} symbol, which is in fact the Greek "rho" character. Generally, in APL a function may take one argument or two arguments or no argument at all. In our case it is exactly one argument, the variable. The variable then must be specified to the right of the function:
 {{{
{{attachment:sampleHtml.png}}


The contents of that page may be read into APL2, APL+Win, Dyalog and NARS200 as follows:

 1. Goto http://download.aplwiki.com/apltree/LatestStableVersions/
 1. Save the contents of that page to a file (say C:\myhtml.htm) (This is done slightly differently for each browser)
 1. Within the APL interpreter:

{{{
    tn←'C:\myhtml.htm' ⎕ntie 0
    conversionCode←82
    noOfBytes←⎕NSIZE ¯1
    startAt←0
    myHtml←⎕NREAD tn, conversionCode, noOfBytes, startAt
    myHtml←(myHtml≠(⎕UCS 13))⊂myHtml
}}}

Note that this code works in Dyalog. It might need some minor changes in other interpreters.

Note that `⎕UCS 13` produces a new line character in Dyalog. If you use a different interpreter check your documentation where in `⎕AV` you can find this character.


=== Examine the HTML code ===

Let's examine the length of the variable we have just created. For this purpose there is a function called "shape", represented by the {{{⍴}}} symbol, which is in fact the Greek "rho" character. Generally, in APL a function may take one argument or two arguments or no argument at all. In our case it is exactly one argument, the variable. The variable then must be specified to the right of the function:
{{{
Line 172: Line 86:
648
 }}}
199
}}}
Line 177: Line 91:
The 648 represents the number of lines (or records) in the file the source code was saved in.

To find out the length of each of the strings in the 648 items of myHtml, we need to introduce APL operators, a concept that is radically different from anything called "operator" elsewhere. In APL, an operator takes at least one function as an operand. It then creates a so-called "derived function" by applying operator-specific rules to that function or functions.
The 199 represents the number of lines (or records) in the file the source code was saved in.

To find out the length of each of the strings in the 199 items of myHtml we need to introduce APL operators, a concept that is radically different from anything called "operator" elsewhere. In APL, an operator takes at least one function as an operand. It then creates a so-called "derived function" by applying operator-specific rules to that function or functions.
Line 183: Line 97:
To find out the length of every single string in myHtml we have to provide the function ⍴ to every single item in that variable:
 {{{
To find out the length of every single string in myHtml we have to provide the function `` to every single item in that variable:
{{{
Line 186: Line 100:
      38 109 73 122 96  0 359 668 0 40 81 46 70 67 51 55 0 59 57 57 59 61 141 7 0 6 114 0 61 0 5 66 20 0 661 222 0 111 6 6 6 0 128 70 62 68
...
 }}}

In fact this is a loop, executed exactly 648 times, but we do not need to know this, or to care about it. <Ruby value="ignore">Every programmer must get exited right now!</Ruby>
      184 103 23 37 80 135 2623 16  8 12 133 31 6  17 5 7 20 88 25 88 77 38 4  313 20 20 135 7 26 8 5...
}}}

In fact this is a loop, executed exactly 199 times, but we do not need to know this, or to care about it. <Ruby value="ignore">Every programmer must get exited right now!</Ruby>
Line 196: Line 109:
The operator "reduce", represented by the {{{/}}} character, is defined as "take its operand (the function) and put it between all the items of the array passed to the derived function. The expression:
 {{{
The operator "reduce", represented by the `/` character, is defined as "take its operand (the function) and put it between all the items of the array passed to the derived function". The expression:
{{{
Line 199: Line 112:
 }}} }}}
Line 202: Line 115:
 {{{ {{{
Line 205: Line 118:
 }}} }}}
Line 211: Line 124:
=== The Strategy === === The strategy ===
Line 219: Line 132:
 * We can ignore the fact that !JavaScript code would look like code following the two rules we just worked out, because the w3c homepage does not contain any !JavaScript.  * We can ignore the fact that !JavaScript code would look like code following the two rules we just worked out, because the page does not contain any !JavaScript.
Line 222: Line 135:
 {{{ {{{
Line 224: Line 137:
 }}}

=== Find Start Points and End Points ===
}}}

=== Find start points and end points ===
Line 232: Line 145:
 {{{ {{{
Line 235: Line 148:
 }}} }}}
Line 240: Line 153:
 {{{ {{{
Line 245: Line 158:
 }}} }}}
Line 255: Line 168:
 {{{ {{{
Line 262: Line 175:
 }}} }}}
Line 265: Line 178:
 {{{ {{{
Line 268: Line 181:
 }}} }}}
Line 273: Line 186:
 {{{ {{{
Line 276: Line 189:
 }}} }}}
Line 279: Line 192:
 {{{ {{{
Line 282: Line 195:
 }}} }}}
Line 285: Line 198:
 {{{ {{{
Line 288: Line 201:
 }}} }}}
Line 300: Line 213:
 {{{ {{{
Line 303: Line 216:
 }}} }}}
Line 306: Line 219:
 {{{ {{{
Line 309: Line 222:
 }}} }}}
Line 312: Line 225:
 {{{ {{{
Line 319: Line 232:
 }}} }}}
Line 322: Line 235:
 {{{ {{{
Line 327: Line 240:
 }}} }}}
Line 331: Line 244:
 {{{ {{{
Line 336: Line 249:
 }}} }}}
Line 343: Line 256:
 {{{ {{{
Line 348: Line 261:
 }}}

Looks good. Now let's perform this function to all items from the w3c's website with the help of the "each" operator and assign the result to a variable "content". We then will display the result in an editor window:
 {{{
}}}

Looks good. Now let's perform this function to all items from the `myHtml` variable with the help of the "each" operator and assign the result to a variable "content". We then will print the result to the session:
{{{
Line 353: Line 266:

}}}

{{attachment:ed_2.png}}

The result obviously contains many empty lines. If you are surprised: any line in the source code which contains nothing but HTML tags is now empty. This is true for all the lines containing <meta> tags, for example.

Strictly speaking however the lines are not empty since every line still holds a line feed character at the moment. With what we've learned so far we can remove this Line Feed characters from each record:

{{{
      myHtml←myHtml~¨⎕AV[2+⎕IO]
}}}

Now we are ready to remove all empty lines in a last step without looking into the details. The following statement removes all blanks from every single item in {{{content}}} and then checks the length of the rest. Only items with a length greater 0 will survive:
{{{
      content←(0<∊⍴¨content~¨' ')/content
      ⎕←content
}}}

However, for this document we shall use this command:

{{{
Line 354: Line 289:
 }}}

{{attachment:ed_2.gif}}

The result obviously contains many empty lines. If you are surprised: any line in the source code which contains nothing but HTML tags is now empty. This is true for all the lines containing <meta> tags, for example.

So let's remove all empty lines in a last step without looking into the details. The following statement removes all blanks from every single item in {{{content}}} and then checks the length of the rest. Only items with a length greater 0 will survive:
 {{{
      content←(0<↑¨⍴¨content~¨' ')/content
 }}}
}}}

in order to display the variable in an editor window. This command is Dyalog specific. Most other APL dialects have similar commands to achieve the same.
Line 367: Line 295:
{{attachment:ed_3.gif}}

== Make it General ==
{{attachment:ed_3.png}}

== Make it general ==
Line 372: Line 300:
 {{{ {{{
Line 376: Line 304:
 }}} }}}

APL in 20 Minutes

Which flavour of APL?

In principle, all code in this article is supposed to work with APL2, APL+WIN, Dyalog APL and NARS2000. There are minor differences, these will be mentioned.

Run APL and use the session manager

As a starting point let us assume that an imaginary user has just started APL by selecting the appropriate command from the Windows "Start" menu.

What you get is APL's development environment, a so-called session-manager. Since APL is an interpreted language, you can type something into the session window and then press <Enter>. APL will try to evaluate your expression and display the result, or it will tell you that something is wrong. The symbol , for obvious reasons called "lamp", indicates a comment: anything on the right of a lamp character is therefore ignored by the interpreter.

Here you can see some simple examples. Input lines are indented by 6 characters, the interpreter's response starts on the left:

      1+2
3
      1+2×3
7
      2×3+1
8 
⍝ ???
⍝ Well, there is only *one* single precedence rule in APL: processing starts from the right
      (2×3)+1
7
⍝ But parenthesis are processed first
⍝ Let us create a variable "int":
      int←1
      int+2
3
      int←3 4 8
      int+1
4 5 9
⍝ APL is an array language:
      int+int
6 8 16

Okay, we got a first impression.

APL: powerful, short, concise

It is believed that with APL even complex problems can be solved in some lines, if not a single one. Is this really true?

I suggest that we solve a small (but not too small) real problem: Let us take the source code of a web page and remove the entire HTML tags from this. As a result we expect to see the real content of the page and nothing else. How fast can this be done?

The task

To make life a bit easier we are going to deal with well-formed code only. So let's take code from a website which provides well-formed code, although most HTML pages on the web are still syntactically incorrect:

http://download.aplwiki.com/apltree/

You should see something similar to this in your browser window:

sampleHtml.png

The contents of that page may be read into APL2, APL+Win, Dyalog and NARS200 as follows:

  1. Goto http://download.aplwiki.com/apltree/LatestStableVersions/

  2. Save the contents of that page to a file (say C:\myhtml.htm) (This is done slightly differently for each browser)
  3. Within the APL interpreter:

    tn←'C:\myhtml.htm' ⎕ntie 0
    conversionCode←82
    noOfBytes←⎕NSIZE ¯1
    startAt←0
    myHtml←⎕NREAD tn, conversionCode, noOfBytes, startAt
    myHtml←(myHtml≠(⎕UCS 13))⊂myHtml

Note that this code works in Dyalog. It might need some minor changes in other interpreters.

Note that ⎕UCS 13 produces a new line character in Dyalog. If you use a different interpreter check your documentation where in ⎕AV you can find this character.

Examine the HTML code

Let's examine the length of the variable we have just created. For this purpose there is a function called "shape", represented by the symbol, which is in fact the Greek "rho" character. Generally, in APL a function may take one argument or two arguments or no argument at all. In our case it is exactly one argument, the variable. The variable then must be specified to the right of the function:

      ⍴myHtml
199      

The result is possibly not exactly the same when this is done right now because the page might have changed in the meantime.

The 199 represents the number of lines (or records) in the file the source code was saved in.

To find out the length of each of the strings in the 199 items of myHtml we need to introduce APL operators, a concept that is radically different from anything called "operator" elsewhere. In APL, an operator takes at least one function as an operand. It then creates a so-called "derived function" by applying operator-specific rules to that function or functions.

Sounds impressive and means nothing to you? Well, let's try to work out what that means in practice. APL comes with an operator "Each", represented by the ¨ symbol. This operator takes its operand and applies it to all elements of the array provided to the right.

To find out the length of every single string in myHtml we have to provide the function to every single item in that variable:

      ⍴¨myHtml
      184  103  23  37  80  135  2623  16  8  12  133  31  6  17  5  7  20  88  25  88  77  38  4  313  20  20  135  7  26  8  5...

In fact this is a loop, executed exactly 199 times, but we do not need to know this, or to care about it. <Ruby value="ignore">Every programmer must get exited right now!</Ruby>

Operators

Let us deviate from the main point for a moment and introduce another operator.

The operator "reduce", represented by the / character, is defined as "take its operand (the function) and put it between all the items of the array passed to the derived function". The expression:

      +/1 2 3 4 5 6 7

therefore means that according to the rules we have just defined, APL will build up this:

      1+2+3+4+5+6+7
28

You might have an idea how extraordinary powerful this concept is, since you can specify any function here which fits, including self-defined ones.

Extract content from Code

The strategy

Now we want to get the content from the code. For this we need a strategy:

  • We know that the < and > characters represent HTML tags, because these characters have to be encoded if they are contained in the content itself.

  • We therefore also know that everything between < and > is not of any interest to us, including < and > itself.

  • We can ignore the fact that JavaScript code would look like code following the two rules we just worked out, because the page does not contain any JavaScript.

So the strategy is easy: build up a mask of Booleans which allows us to get rid of all the HTML stuff. Let's work out how we can do this in APL. In the first step, we create a small string we can play with:

      testVars←'<p style="font-size: medium">Hello, world</p><p>There we go</p>'

Find start points and end points

In first place, we need to know where the < and the > characters are. For this, we have to master the "membership" function, represented by the Greek character

It does exactly what the name suggests: it looks for every element in the left argument if it is contained in the right argument. If that is the case, a 1 is returned, otherwise a 0:

      testVars∊'<>'
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1

As you can see, Booleans are represented by 1 (true) and 0 (false) in APL.

For the next step we need another operator which is very close to the "reduce" operator we already met: it is called "expand", which uses the symbol \. Let us start with some expressions to get familiar with this new operator:

      +/1 2 3 4  ⍝ Reduce: APL makes 1+2+3+4 from this
10
      +\1 2 3 4  ⍝ Expand: APL makes (1) (1+2) (1+2+3) (1+2+3+4) from this 
1 3 6 10     

Expand returns a result for every single step. Let's follow the interpreter step by step:

  1. The first item (the 1) is taken and printed
  2. The first and the second item are added up and the result (3) is printed
  3. The result we just got (3) is taken and added to the third element which results to 6
  4. The result we just got (6) is taken and added to the forth element which results to 10

So far so good. Let's try the same thing with the function. This is a very simple function that takes a left and a right argument and checks them for being different. The result is a Boolean:

      0≠0
0
      1≠0
1
      1≠1
0

In APL, you can use Booleans in arithmetic operations:

      3+9=9
4

First 9=9 is processed (right-to-left!) which returns a 1 for true and then the 1 is added to 3.

Let's use the membership function to find out where the < and > are located:

      testVars∊'<>'
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1

And now we put some magic in place:

      ≠\testVars∊'<>'
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0

That is a big step forward. Let's look into the details:

      ≠\1 0 0 0 1 0 1
1 1 1 1 0 0 1

According to the rule we just worked out, the "expand" operator \ performs the following steps:

  1. Take the first item (1) and print it
  2. Take the first and second item and pass them as left and right argument to the function, here ≠; That leads to 1 ≠ 0 which is true, so a 1 is returned.
  3. Take the last result (1) and take the third item: 1 ≠ 0 results in 1
  4. Take the last result (1) and take the forth item: 1 ≠ 0 results in 1
  5. Take the last result (1) and take the fifth item: 1 ≠ 1 results in 0
  6. Take the last result (0) and take the sixth item: 0 ≠ 0 results in 0
  7. Take the last result (0) and take the seventh item: 0 ≠ 1 results in 1

Now we can use the vector of Booleans we got to mask the HTML code. For this we use again "reduce" (/), but this time we do not provide a function as left operand but the vector of Booleans. An easy example:

      1 0 0 1 1/'abcde'
ade

As you can see, items associated with a 1 are still represented in the result while those associated with a zero are not. We can use this to do:

      (≠\testVars∊'<>')/testVars
<p style="font-size: medium"</p<p</p

Upps. We where looking for the content, not the HTML code. We have to negate the Booleans. This can be done with the function ~ :

      ~1
0     
      ~0
1      
      (~≠\testVars∊'<>')/testVars   
>Hello, world>>There we go>

We are almost there. Only that the starting character of any piece of HTML code has survived is still a problem. In the next step the Boolean "or" function is used to solve this problem

      ≠\1 0 0 0 1 0 1
1 1 1 1 0 0 1
      a∨≠\a←1 0 0 0 1 0 1
1 1 1 1 1 0 1

We use these techniques now to define a function:

 [0]  r←enclosedBy Strip string;mask
 [1]  mask←string∊enclosedBy
 [2]  mask←~mask∨≠\mask
 [3]  r←mask/string

Now we have a function "Strip". The vector of Booleans is assigned to the variable "mask" which is a local one. That means that this variable exists only inside the "Strip" function when the code is executed.

Problem solved

Let us test our newly created function:

      testVars ⍝ calling the name of a variable simply prints its value
<p style="font-size: medium">Hello, world</p><p>There we go</p>
      '<>' Strip testVars
Hello, worldThere we go

Looks good. Now let's perform this function to all items from the myHtml variable with the help of the "each" operator and assign the result to a variable "content". We then will print the result to the session:

      content←(⊂'<>')Strip¨myHtml

ed_2.png

The result obviously contains many empty lines. If you are surprised: any line in the source code which contains nothing but HTML tags is now empty. This is true for all the lines containing <meta> tags, for example.

Strictly speaking however the lines are not empty since every line still holds a line feed character at the moment. With what we've learned so far we can remove this Line Feed characters from each record:

      myHtml←myHtml~¨⎕AV[2+⎕IO]

Now we are ready to remove all empty lines in a last step without looking into the details. The following statement removes all blanks from every single item in content and then checks the length of the rest. Only items with a length greater 0 will survive:

      content←(0<∊⍴¨content~¨' ')/content
      ⎕←content

However, for this document we shall use this command:

      )ed content

in order to display the variable in an editor window. This command is Dyalog specific. Most other APL dialects have similar commands to achieve the same.

Having executed that expression, the contents of the editor window has changed:

ed_3.png

Make it general

A last remark: the function "Strip" is able to do more than simply remove HTML code. Thanks to the fact that we can provide a left argument which will override the default (<>) we can use the function for other purposes as well:

      v←'This is some nonsense (written by me) you can forget'
      '()'Strip v
This is some nonsense  you can forget

Created by KaiJaeger


CategoryGuides

AplIn20Minutes (last edited 2017-02-16 19:31:38 by KaiJaeger)