Red by Example - an accessible reference by example

Last update on 20-Dec-2019

parse

index     parse     vid     series     draw     help     about     links     contact

1. PARSE IN RED
1.1. About The Parse Dialect
1.2. Parsing - An Introduction
1.3. Getting Started
1.4. Sequence, selection, repetition: Some, Any, |
1.5. Characters, Numbers, Names
1.6. Calculator Example With Strings
1.7. Inserting Red Code:  (  ) ,  Copy, Set
1.8. Calculator With Block Type: Copy And Set
1.9. Literal Types In String Input
2.1. Keywords inside/outside Parentheses
2.2. Skipping Input: Skip, To, Thru
2.3. Using Variable:
2.4. Using :Variable
2.5. Using a Variable
2.5. Remove, Insert, Change
2.6. Collect And Keep
2.7. Parse into/ahead (nested blocks)
3.1. Debugging - parse-trace

1. PARSE IN RED

1.1. About The Parse Dialect


The 'parse'  facility of Red  helps us to  build  mini-languages (DSLs 
- domain-specific languages). It lets us specify syntax, and also provides 
a simpler alternative to regular expressions when processing strings.  
Parse is built-in to Red, so we can use it as part of a larger Red  program.

Here is an example showing how simple and lightweight it can be: we have a 
2-character string, which should contain a product code of the form "A", 
"B" or "C" followed by either "1", "2", or "3", such as "A2" or "C3".  The 
following code does the validation, and Parse returns false if the string 
is incorrect:

product: "C3"
parse/case product [["A" | "B" | "C"] ["1" | "2" | "3" ]] ;-- '|' means 'or'

The input to Parse is a series (the data we are processing in some way) and 
a set of grammar rules.

The rules are similar to BNF specifications, but can also contain Red code 
and commands to copy and skip through the input.

Though Parse has things in common with compiler-compilers, it has no 
built-in facilities for e.g. symbol-table handling.  It is  simpler to use, 
however.  In fact, major parts of Red are themselves created with Parse, 
such as Draw and Vid.

In these notes, I will look at string and block input series, though 
any series! types are allowed, except image! and vector!.

If your input format has nothing to do with Red (e.g. HTML files, exported 
spreadsheets, strings from  a data-entry form, etc.) then you will use 
string input to Parse.  For some tasks, there might be lots of low-level 
rules, such as stating that an integer consists of several digits, or that 
a series of  spaces separates items.

On the other hand, the really interesting stuff is to build a DSL to be
used within Red.  If your input is blocks of Red, then Parse works at a 
higher level.  It knows that spaces separate items, and that 03:10:15 is a 
time, for example.  In fact most of the literal types are recognised.  This 
is normally the approach used to build a DSL. 

Red is heavily based on REBOL of course, and Red's Parse has extra features 
over REBOL 2's version.  REBOL users please note that the string-split 
feature in REBOL 2 has been moved from Parse into a separate split 
function, and that parse in Red is the same as parse/all in REBOL.


For more information on Parse:

Introducing Parse, by Nenad Rakocevic.

http://www.red-lang.org/2013/11/041-introducing-parse.html

The Parse chapter, from the REBOL 2  documentation:

http://www.rebol.com/docs/core23/rebolcore-15.html

top

1.2. Parsing - An Introduction


The syntax rules for programming languages are often specified in some kind 
of BNF.  
Basically, we need to create  rules which express:

** sequence: one item is followed by another item.

**  choice: an item can be a selection from several things.

** repetition:  an item can be made from  a repetition of items.  Sometimes 
 it is helpful to be able to express 'one or more of' and 'zero or more of'.

** sub-rules.  It is convenient to express sub-rules, breaking up complex 
 syntax into manageable chunks.  Sometimes it is useful to use recursive 
 rules for nested input.

Here is how these concepts occur in a fragment of a BASIC-style language:

    if a<b then c=42
    print "values ", a, b+2, c

Informally, we can say:

** A program consists of any number of statements.

** An 'if-statement' is a sequence of "if", a condition, "then", and a 
 statement.

** A 'print-statement' is a sequence.  It starts with the word "print", and 
 is followed by any number of items, with a "," between them.  Each item is 
 a selection from a quoted string and a numeric expression.

We would probably write sub-rules (sometimes called 'classes' in 
syntax-analysis) for a statement, a print-statement, an if-statement, a 
numeric-expression etc.  This simplifies things for humans, and allows 
re-use for commonly-occurring items.

Now we will look at some Parse examples.

  
top

1.3. Getting Started


Here is a tiny Red program which uses Parse:

    Red [ "Parsing"]
    parse-rules: ["move-" "north"]   ;-- sequence
    input: "move-north"
    print parse input parse-rules    ;-- prints true

    input: "move-south"
    print parse input parse-rules    ;-- prints false

It begins with Red[ ], like all Red programs.  We will omit this in the 
following code fragments.

Parse is a function which returns true if its input matches the rules, 
otherwise false.    

If we wanted to allow any direction, we could write a sub-rule.  We choose 
a name for it (using Red's rules for naming), and use  [   |   |   |   ....  
] for a choice:

    parse-rules: ["move-" direction]
    direction: ["north" | "south" | "east" | "west"]  ;-- choice
    input: "move-south"
    print parse input parse-rules   ;-- true

Note that string-matching is case-insensitive.  For case-sensitive 
matching, use the /case refinement.

We could have written the above as:

    parse-rules: ["move-" ["north" | "south" | "east" | "west"]]
    input: "move-south"
    print parse input parse-rules 

but a sub-rule provides  manageable chunks for humans.

With string input, spaces and newlines have no special significance.  For 
example, if we want a space to be a separator, we must say so in the 
rules.  With block input, things are different - this is covered later.

top

1.4. Sequence, selection, repetition: Some, Any, |


Whether the input is a string or a block, the action of these words is 
identical.  To specify a sequence, we write:

    [item1  item2  item3  etc...]

'Item' can be primitive thing, or can be a sub-rule.

To specify a selection, we write:

    [item1 | item2 | item3 | etc...]

To specify repetition, we write:

    [some[ item1 item2 item3 etc]]

'some' requires at least one occurrence. 

Note that we can write such rules as:

    [some "A" "B"]

but here, 'some' only applies to the first item, matching "AB"  "AAB" etc, 
but not "ABAB". I will opt for always using [ ] for 'some', even if they 
only enclose one item.

We can also use 'any', which specifies zero or more repetitions, as in:

    [any["A" "B"]]   ;-- e.g. "ABAB"  "AB"  ""

'Some' and 'any' will terminate when they encounter  an item that does not 
match, so in the case of some["A" "B"]

** "ABABABC" input  -  'some' will match "AB" then "AB" etc. successfully, 
 then terminate when it reaches "C"

** "C" input  -  'some' will not match because there are no "AB"s .  If we 
 used 'any', the match would succeed, terminating on "C".

There are other convenient forms of rules, as in:

[3 "A"]       - a count, matches "AAA" only.
[1 3 "A"]     - a range, matches "A" or "AA"  or "AAA".
[0 3 "A"]     - a range, but zero occurrences are matched, as in "".
["A" | none]  - a selection, matching "A" or "".
              - This lets us detect a missing "A" specifically.


top

1.5. Characters, Numbers, Names


This is the relatively low-level (lexical analysis) area.  We can specify, 
for example, than an integer is a series of digits (at least one).  We can 
make use of charset!:

    digit: charset "0123456789" ;-- any of these. (could also use  '-'
    integer: [some digit]       ;-- one or more

Charset! is often used for speed reasons.  We could also have used the 
slower:

    digit: [ "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" 
]
Here is how we can specify a typical variable name, which starts with a 
letter, then has letters or digits following:

letter: charset [#"A" - #"Z" #"a" - #"z"]  ;-- A to Z  and a to z
alpha-numeric: [letter | digit]
var-name: [letter [any [alpha-numeric]]]

(We can use the set functions 'union',  'intersect' etc for higher speed).

top

1.6. Calculator Example With Strings


Here is a toy example.  A calculation is made up of +, -, a memory (mem), 
and  integers.  Some calculations:

    +33-22
    -44
    +55+22-56+mem+1

There are 2 types of statement: memory and display, as in

    mem=+33+55
    display mem+100

There are no extra spaces, and a statement is followed by a newline.  Of 
course, we can write such calculations in straight Red - this is merely an 
example of parsing.  Here is one possible set of parse rules, followed by 
some code in our tiny language:

program: [some [statement] ]        ;-- at least one statement
statement: [[mem-instruction | display-instruction] newline]
mem-instruction: ["mem=" calculation]
display-instruction: ["display" space calculation]
calculation: [[some[pair]]] 
pair: [operator primitive]          ;-- all primitives preceded by op
operator: [ "+" | "-"] 
primitive: [ num |"mem" ] ;-- e.g. 123, mem
digit: charset "0123456789"
num: [1 6 digit ]                   ;-- 1 to 6 digits in a num

code: {mem=+3-4+1
mem=+mem+1
display -100-mem+101
}

either parse code program [
    print "Checked OK"
] [
    print "Error"
]

Some points about the rules:

** program: we used 'some'.  This disallows programs with no statements.  
 If we wanted, we can specifically check this case using 'none', perhaps to 
 display a targetted error message.  The rule would then be

    program: [some [statement] | none ] 

** statement: there are two types, either must be followed by a newline.  
 Note that space and newline are pre-defined values in Red.

** calculation: most of the work is done here.  The smallest legal 
 calculation is e.g. +3.   A calculations consists of 'some' pairs, in which 
 a pair is an operator (only '+' and '-' in this version) followed by a 
 primitive item.

** primitive:  either a 6-digit number, or the text "mem".  We are using 
 string input here, and we allow a maximum of 6 digits.  But what about 
 floating-point?  We could try to write the syntax for this, but a better 
 approach would be to use block input, and let Red recognise the types it 
 already knows about.  This is shown later. 

** I could have used less rules, but personally I find the extra names 
 improves clarity.

top

1.7. Inserting Red Code: ( ) , Copy, Set


We can insert Red code in our rules by enclosing it in parentheses.  For 
example, if we modify our calculator 'num' rule to:

    num: [1 6 digit (print "got num")]   

we will see "got num" displayed 4 times with our input of

in: {mem=+3+1234
display -4+3-mem
}

Note that if the rule is entered, but no match happens, code at the end of 
the rule is not executed.  At the console, for example:
 
<< rule: [(print "A") "start" (print "B") ]  ;-- match "start"
<< parse "start" rule       
A
B
== true                 ;-- matches OK

<< parse "st-art" rule
A
== false                ;-- no match, B not displayed

Similarly, in a selection, if we have

some-rule: ["begin" | "end"]

and we add some code like this:

some-rule: ["begin" | "end" (print "In some-rule") ]

it will only be executed when "end" is matched.  To see the message in 
either case, we should also put it immediately after "begin".

Within (...) we can put any Red code, such as using variables and calling 
functions.

The 'copy' word can be used to access the matched text.  It is followed by 
a variable name, and must directly precede a rule.  (i.e. don't put any 
bracketed code between 'copy' and a rule.  Here is an example:

    num: [copy number 1 6 digit (print[ "number: " number]) ]
    digit: charset "0123456789"
    parse "123" num

The rule is '1 6 digit', and it is preceded by a copy.  We are free to 
choose the variable name for copy.  When the rule succeeds (as it does 
above)  the print is executed, showing  123.

Here is another example:

    operator: [copy  op "+" | copy op "-"]

Note the copy in each selection.  Our 'op' variable can be referred to in 
another rule.  Later we will use 'set', which is similar to 'copy'.

There is also  a 'copy' in Red, but the Parse copy is different.

Here is the same grammar, with 'copy' and parenthesised code used to 
perform the execution:

total: 0
memory: 0
use-num: func [] [
    either op = "-" [;-- update the total   + or -
        total: total - to integer! prim
    ] [
        total: total + to integer! prim
    ]
]

program: [some [statement]]
statement: [[mem-instruction | display-instruction] newline]
mem-instruction: ["mem=" calculation (memory: total)] ;-- update memory
display-instruction: 
    ["display" space calculation (print ["display:  " total])]
calculation: [(total: 0) [some [pair]]] ;-- initialise the calculation
pair: [operator primitive]
operator: [copy op "+" | copy op "-"] ;-- remember the op
primitive: 
    [copy prim num (use-num) | copy prim "mem" (prim: memory use-num)]
digit: charset "0123456789"
num: [1 6 digit]

code: {mem=+3-4+1
mem=+mem+1
display -100-mem+101
}
print "Code:"
print code

either parse code program [
    print "Checked OK"
] [
    print "Error"
]

and here is the output:

    Code:
    mem=+3-4+1
    mem=+mem+1
    display -100-mem+101
    display:   0
    Checked OK

A function was used to do the addition or subtraction, but similar code 
could have been embedded in the rules.

Finally, we will put some calculator code in a file, and read it in. We 
create a file named (for example) calc-code.txt, and put our code there, as 
in:


mem=+3-4+1
mem=+mem+1
display -100-mem+101

Now we modify our program to read the file:

    code: read %calc-code.txt    ;-- in Red, % identifies a file name
    parse code program

top

1.8. Calculator With Block Type: Copy And Set


The above calculator worked, but the use of 6-digit integers was 
unrealstic.   Also, what if we wanted to change our calculator to work with 
float! or time! types?  Their syntax is not simple.  In such 
situations, we would use block input.   Here is some input for the block 
version:

code: [mem = + 4 - 4 + 1
mem = + mem + 1
display - 100 - mem + 101
]

We have to put spaces between items now.  In fact, the input code is now 
quite close to Red, and  there are various ways of interpreting it.  
However, we will continue with Parse, and compare  it to the string version 
above.  Here is the Parse code:

memory: 0
total: 0
use-num: func [] [
    either op = '- [ 
        total: total - prim
    ] [
        total: total + prim
    ]  
]
     
program: [some [statement]]
statement: [[mem-instruction | display-instruction]]
mem-instruction: ['mem '= calculation (memory: total)]
display-instruction: ['display calculation (print ["display: " total])]
calculation: [(total: 0) [some [pair]]]
pair: [operator primitive]
operator: [set op '+ | set op '- ]
primitive: 
    [set prim integer! (use-num) | set prim 'mem (prim: memory use-num)]

The differences from the string version are:

** we have removed references to space and newline.  We use the Red 
 approach of spaces separating items.

** the 'primitive' rule now uses the Red integer! type.  All literal 
 types (e.g. pair!, time!, float! etc can be similarly matched.  
 We can use any-type! to match any item in the input.

** to match a word - such as mem, we use 'mem.  Word matching is 
 case-insensitive.

** we used 'set' rather than 'copy'.  In the above, the 'copy' variable is 
 made into  a series type, even if it holds a single item, whereas the type 
 of the 'set' variable is what we want here (i.e integer! in the 
 'primitive' match).

Strictly, the value of 'copy' is the collection of matched items, whereas 
'set' uses the first matched item.  Here are two examples

    parse "Stuff---" ["Stuff" copy d some "-" (print [type? d d])]
    parse "Stuff---"  ["Stuff" set d some "-" (print [type? d d])]

In the first example, 'copy' contains the whole match "---" as a string, 
and in the second, 'set' contains the first matched item as the character 
#"-".

Here are some copy/set examples using blocks:

parse [Stuff 12 34] ['Stuff copy matched some integer!
   (print [type? matched matched])]
parse [Stuff 12 34] ['Stuff set matched some integer!
   (print [type? matched matched])]

The output is:

block 12 34 
integer 12

We can also read data from a file into a block with load, and use it 
directly in Parse, as in:

    parse   load %block-in.txt program

Where block-in.txt holds:

mem = + 4 - 4 + 1
mem = + mem + 1
display - 100 - mem + 101 - 2222

top

1.9. Literal Types In String Input


Parse rules used with strings cannot contain type names (pair!, 
integer!, float! etc). However, some literal values (not type 
names) can be used.  These are url!, email!, and tag!.  Tag 
values are the most common, as they are useful in HTML processing.  Here 
are some examples.  They are all 'true':

    print parse "Stuff>atag<" ["Stuff" >atag<]
    print parse "Stuffhttp://me.com" ["Stuff" http://me.com]
    print parse "Stuffme@super.org" ["Stuff" me@super.org]

Note the reduction of quotes in the rules, though we could also use:

    print parse "Stuff>atag<" ["Stuff" ">atag<"]
 
top

2.1. Keywords inside/outside Parentheses


We have seen that rules can contain Red code in parentheses, and also 
commands - such as 'copy' outside parentheses.  There are a number of other 
commands (keywords) that belong to Parse.  Sometimes they have the same 
name as Red words, but they belong to Parse, and in general they have a 
different meaning.  You have seen 'copy', 'set', 'some', 'any', etc and now 
we will look at others.  Many of them allow us to manipulate the input 
series.

A full list is available in:
Introducing Parse, by Nenad Rakocevic.
href="http://www.red-lang.org/2013/11/041-introducing-parse.html">http://www.
red-lang.org/2013/11/041-introducing-parse.html

top

2.2. Skipping Input: Skip, To, Thru


These words let us move through the input.  

'Skip'  - this skips one item in the input series.  For example:

    rule: ["a" skip "b"]
    print parse "axb" rule        ;-- true

Above, the rule states:
    - match an "a"
    - skip over any character
    - match a "b"

Using the above rule with different data:

    print parse "axc" rule       ;-- false - no b
    print parse "axxb" rule      ;-- false - no a after first x

Here is a block input example, which matches an integer, skips the next 
item (an integer here), then matches a string.  Parse returns 'true':

    print parse [123 456 "Hello"] [integer! skip string! ]

Now some more 'skip' examples, all true, using a range:

    print parse "axxxb" ["a" 1 3 skip "b"]   ;-- a, do 3 skips, b
    print parse "xxxa" [3 skip  "a"]         ;-- 3, skips, a

The 'thru' facility skips up to  a specified item, as in:

    parse "axxxbc" ["a" thru "b" "c"] ;--true: a, skip to b inclusive, c

In the above, 'b' is also skipped, and matching continues after it.

Note that this also works for multi-character strings in rules, as in:

    parse "axxxxxbeec" ["a" thru "bee" "c"]  ;-- skips to bee, matches c 

and from the start, here using a sub-rule:

    animal: ["black " "cat"]
    parse "whatever---black cat" [thru animal]        ;-- true

Alongside 'thru' there is the similar 'to', which does not include the item 
which ends the skip.  In the following example, the first 'b' is detected 
but is not part of the skip.  The second 'b' in the rule will match it:

 parse "axxxxxb" ["a" to "b" "b" ]    ;--  up to b, not including it.

Here is an example which copies the title of a web page:

html-code: ">html< >title<My Great Page>/title< Contents 
here... >/html<"
parse html-code [thru >title< copy the-title  to >/title<]
print ["Title is: " the-title]    ;-- prints:   My Great Page

We used the tag type in the rule, but could have used e.g. 
">title<".  Note the use  of  'to'.  If we used 'thru', then the 
result is:  

My Great Page>/title< 

 .top

2.3. Using Variable:


We can  create a variable, and set its value.  This is a Parse feature.  
Here are some examples.  First we look at getting a position in the series:

parse "ABCDEFG" [thru "D" place: (print place) ]     ;-- EFG printed

We created the 'place:' variable.  Here, we skip to "D", including it.  
Then, 'place' becomes a reference to the current input position.  Here, 
this is from "E" to the end.
	
Here is an example which finds the positions of items between START and 
STOP.  It uses the index? series function to get a numeric position 
from the reference to the series.

    parse [A B START C D E STOP F G] [
        thru 'START pos1: to 'STOP pos2: 
           (print [index? pos1 "-" (index? pos2) - 1 ])  ;-- prints 4 - 6
    ]    

If we display the two positions as references without using index?, we 
would get:

    C D E STOP F G - STOP F G

Here is another example of finding positions, this time in HTML text.  We 
are interested in >h1< items.

page: {>html< >title< My Great Page>/title<
>h1< Big Heading A<>/h1<>p<Stuff in  A >/p<
>h1< Big Heading B<>/h1<>p<Stuff in  B >/p<
>/html<
    }

;-- positions of text in  an h1
parse page [ any [thru >h1< h1-at: (print ["h1 at: " index? h1-at])]]
;-- position of the > in >h1<
parse page [ any [to ">h1" h1-at: thru "<" (print ["h1 at: " index? 
h1-at])]]

The first example uses an unquoted tag in the rule, and 'thru' includes the 
whole tag.

The second example uses a quoted string to find ">h1", notes the 
position, then moves to the closing "<".  

Note that if we try to do the second example with 'to', as in:

    any[ to >h1< ...]

Then parse gets stuck, repeatedly finding the same >h1<.

top

2.4. Using :Variable


Here we look at modifying the input series.  This example finds the start 
and end of the page's title text, and uses these references to modify the 
input series:

    parse page [thru <title> begin: to </title> ending:
        (change/part begin "A Better Title" ending) 
    ]
 

top

2.5. Using a Variable


We have looked at the use of ':word' and  'word:'.  If we use a word 
without a colon, its value is looked up and used, as in:

    a-word: 3
    print parse "xxxa" [a-word skip  "a"]      ;-- true

We could use this approach for counts, match-values etc.

top

2.5. Remove, Insert, Change


These Parse keywords modify the input series.  They can be more convenient 
than using Red code in parentheses.

We can remove the matched input, as in:

    inp: "XXX-------"
    parse inp [any "X" remove any "-"]           ;
    print inp          ;-- prints XXX

We can insert an item (e.g. a string, a block) at current input position.  
Scanning continues past the insertion.

    inp: "XXX-------"
    parse inp [any "X" insert "ABC" any "-"]           ;
    print inp           ;-- prints XXXABCabc-------

We can change the match input to  a given value:

    inp: [12 "a string" "b string" 22 44]  ;-- unordered strings,  integers
    parse inp [some[string! |  change integer! "NUMBER" ]]
    print inp
    print ""

The output is:

   NUMBER a string b string NUMBER NUMBER

top

2.6. Collect And Keep


Parse has its own collect and keep functions, similar to those in Red. 
Here we parse a block of various types of item, and keep only the 
integers.  Keep can be used several times.  Note that we match an integer 
before the more general any-type!.  There is also  a 'pick' option for 
'keep', which allows control over the storing of the selection.

    in: [12 3x44 "a string" 13  a-word 14]
    ;-- prints 12 13 14
    print parse in [collect[some[ keep integer! | any-type!]]] 
The 'collect' function returns a block via Parse, i.e. the normal Parse 
logic! result is not returned.


top

2.7. Parse into/ahead (nested blocks)


Programming languages usually have some way of delimiting nested code, such
as curly-brackets in C++, Java.  It is possible that your DSL might need nested
items.  We can make use of Red's square brackets for this.  Here is an example of
a nested structure. We can make the parser go into nested blocks with 'into'.

First we look 'ahead' to see if the next item is a block.

    ;-- a series of:    integers or nested integers
    list: [22 33 [44 [55] 66] 77]

    nested-rule: [ any[ item ]]
    item: [copy num integer! (print num)
           | ahead block! (print "got a block")
             into nested-rule (print "out of block")]
    parse list nested-rule

The output is:

    33
    got a block
    44
    got a block
    55
    out of block
    66
    out of block
    77 

top

3.1. Debugging - parse-trace


To look into how your rules are working, you can insert printing to provide 
a trace, as in:

    primitive: [ num | "mem" ]               ;-- original rule
    ;-- with prints
    primitive: [num (print "got num") | "mem" (print "got mem")]

You can also use the 'parse-trace' wrapper for Parse, as in:

    parse-trace some-input  a-rule

This displays a detailed trace of the parse process.