Red by Example - an accessible reference by example

Last update on 20-Dec-2019

parse

index     parse     vid     series     draw     help     about     links     contact     
1. PARSE IN RED
1.1. About The Parse Dialect
1.2. Parsing - An Introduction
1.3. Getting Started
1.4. Sequence, selection, repetition: Some, Any, |
1.5. Characters, Numbers, Names
1.6. Calculator Example With Strings
1.7. Inserting Red Code: ( ) , Copy, Set
1.8. Calculator With Block Type: Copy And Set
1.9. Literal Types In String Input
2.1. Keywords inside/outside Parentheses
2.2. Skipping Input: Skip, To, Thru
2.3. Using Variable:
2.4. Using :Variable
2.5. Using a Variable
2.5. Remove, Insert, Change
2.6. Collect And Keep
2.7. Parse into/ahead (nested blocks)
3.1. Debugging - parse-trace

1. PARSE IN RED


1.1. About The Parse Dialect


The 'parse' facility of Red helps us to build mini-languages (DSLs
- domain-specific languages). It lets us specify syntax, and also provides
a simpler alternative to regular expressions when processing strings.
Parse is built-in to Red, so we can use it as part of a larger Red program.

Here is an example showing how simple and lightweight it can be: we have a
2-character string, which should contain a product code of the form "A",
"B" or "C" followed by either "1", "2", or "3", such as "A2" or "C3". The
following code does the validation, and Parse returns false if the string
is incorrect:

product: "C3"
parse/case product [["A" | "B" | "C"] ["1" | "2" | "3" ]] ;-- '|' means 'or'

The input to Parse is a series (the data we are processing in some way) and
a set of grammar rules.

The rules are similar to BNF specifications, but can also contain Red code
and commands to copy and skip through the input.

Though Parse has things in common with compiler-compilers, it has no
built-in facilities for e.g. symbol-table handling. It is simpler to use,
however. In fact, major parts of Red are themselves created with Parse,
such as Draw and Vid.

In these notes, I will look at string and block input series, though
any series! types are allowed, except image! and vector!.

If your input format has nothing to do with Red (e.g. HTML files, exported
spreadsheets, strings from a data-entry form, etc.) then you will use
string input to Parse. For some tasks, there might be lots of low-level
rules, such as stating that an integer consists of several digits, or that
a series of spaces separates items.

On the other hand, the really interesting stuff is to build a DSL to be
used within Red. If your input is blocks of Red, then Parse works at a
higher level. It knows that spaces separate items, and that 03:10:15 is a
time, for example. In fact most of the literal types are recognised. This
is normally the approach used to build a DSL.

Red is heavily based on REBOL of course, and Red's Parse has extra features
over REBOL 2's version. REBOL users please note that the string-split
feature in REBOL 2 has been moved from Parse into a separate split
function, and that parse in Red is the same as parse/all in REBOL.


For more information on Parse:

Introducing Parse, by Nenad Rakocevic.

http://www.red-lang.org/2013/11/041-introducing-parse.html


The Parse chapter, from the REBOL 2 documentation:

http://www.rebol.com/docs/core23/rebolcore-15.html


top

1.2. Parsing - An Introduction


The syntax rules for programming languages are often specified in some kind
of BNF.
Basically, we need to create rules which express:

** sequence: one item is followed by another item.

** choice: an item can be a selection from several things.

** repetition: an item can be made from a repetition of items. Sometimes
it is helpful to be able to express 'one or more of' and 'zero or more of'.

** sub-rules. It is convenient to express sub-rules, breaking up complex
syntax into manageable chunks. Sometimes it is useful to use recursive
rules for nested input.

Here is how these concepts occur in a fragment of a BASIC-style language:

if a<b then c=42
print "values ", a, b+2, c

Informally, we can say:

** A program consists of any number of statements.

** An 'if-statement' is a sequence of "if", a condition, "then", and a
statement.

** A 'print-statement' is a sequence. It starts with the word "print", and
is followed by any number of items, with a "," between them. Each item is
a selection from a quoted string and a numeric expression.

We would probably write sub-rules (sometimes called 'classes' in
syntax-analysis) for a statement, a print-statement, an if-statement, a
numeric-expression etc. This simplifies things for humans, and allows
re-use for commonly-occurring items.

Now we will look at some Parse examples.


top

1.3. Getting Started


Here is a tiny Red program which uses Parse:

Red [ "Parsing"]
parse-rules: ["move-" "north"] ;-- sequence
input: "move-north"
print parse input parse-rules ;-- prints true

input: "move-south"
print parse input parse-rules ;-- prints false

It begins with Red[ ], like all Red programs. We will omit this in the
following code fragments.

Parse is a function which returns true if its input matches the rules,
otherwise false.

If we wanted to allow any direction, we could write a sub-rule. We choose
a name for it (using Red's rules for naming), and use [ | | | ....
] for a choice:

parse-rules: ["move-" direction]
direction: ["north" | "south" | "east" | "west"] ;-- choice
input: "move-south"
print parse input parse-rules ;-- true

Note that string-matching is case-insensitive. For case-sensitive
matching, use the /case refinement.

We could have written the above as:

parse-rules: ["move-" ["north" | "south" | "east" | "west"]]
input: "move-south"
print parse input parse-rules

but a sub-rule provides manageable chunks for humans.

With string input, spaces and newlines have no special significance. For
example, if we want a space to be a separator, we must say so in the
rules. With block input, things are different - this is covered later.

top

1.4. Sequence, selection, repetition: Some, Any, |


Whether the input is a string or a block, the action of these words is
identical. To specify a sequence, we write:

[item1 item2 item3 etc...]

'Item' can be primitive thing, or can be a sub-rule.

To specify a selection, we write:

[item1 | item2 | item3 | etc...]

To specify repetition, we write:

[some[ item1 item2 item3 etc]]

'some' requires at least one occurrence.

Note that we can write such rules as:

[some "A" "B"]

but here, 'some' only applies to the first item, matching "AB" "AAB" etc,
but not "ABAB". I will opt for always using [ ] for 'some', even if they
only enclose one item.

We can also use 'any', which specifies zero or more repetitions, as in:

[any["A" "B"]] ;-- e.g. "ABAB" "AB" ""

'Some' and 'any' will terminate when they encounter an item that does not
match, so in the case of some["A" "B"]

** "ABABABC" input - 'some' will match "AB" then "AB" etc. successfully,
then terminate when it reaches "C"

** "C" input - 'some' will not match because there are no "AB"s . If we
used 'any', the match would succeed, terminating on "C".

There are other convenient forms of rules, as in:

[3 "A"] - a count, matches "AAA" only.
[1 3 "A"] - a range, matches "A" or "AA" or "AAA".
[0 3 "A"] - a range, but zero occurrences are matched, as in "".
["A" | none] - a selection, matching "A" or "".
- This lets us detect a missing "A" specifically.


top

1.5. Characters, Numbers, Names


This is the relatively low-level (lexical analysis) area. We can specify,
for example, than an integer is a series of digits (at least one). We can
make use of charset!:

digit: charset "0123456789" ;-- any of these. (could also use '-'
integer: [some digit] ;-- one or more

Charset! is often used for speed reasons. We could also have used the
slower:

digit: [ "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
]
Here is how we can specify a typical variable name, which starts with a
letter, then has letters or digits following:

letter: charset [#"A" - #"Z" #"a" - #"z"] ;-- A to Z and a to z
alpha-numeric: [letter | digit]
var-name: [letter [any [alpha-numeric]]]

(We can use the set functions 'union', 'intersect' etc for higher speed).

top

1.6. Calculator Example With Strings


Here is a toy example. A calculation is made up of +, -, a memory (mem),
and integers. Some calculations:

+33-22
-44
+55+22-56+mem+1

There are 2 types of statement: memory and display, as in

mem=+33+55
display mem+100

There are no extra spaces, and a statement is followed by a newline. Of
course, we can write such calculations in straight Red - this is merely an
example of parsing. Here is one possible set of parse rules, followed by
some code in our tiny language:

program: [some [statement] ] ;-- at least one statement
statement: [[mem-instruction | display-instruction] newline]
mem-instruction: ["mem=" calculation]
display-instruction: ["display" space calculation]
calculation: [[some[pair]]]
pair: [operator primitive] ;-- all primitives preceded by op
operator: [ "+" | "-"]
primitive: [ num |"mem" ] ;-- e.g. 123, mem
digit: charset "0123456789"
num: [1 6 digit ] ;-- 1 to 6 digits in a num

code: {mem=+3-4+1
mem=+mem+1
display -100-mem+101
}

either parse code program [
print "Checked OK"
] [
print "Error"
]

Some points about the rules:

** program: we used 'some'. This disallows programs with no statements.
If we wanted, we can specifically check this case using 'none', perhaps to
display a targetted error message. The rule would then be

program: [some [statement] | none ]

** statement: there are two types, either must be followed by a newline.
Note that space and newline are pre-defined values in Red.

** calculation: most of the work is done here. The smallest legal
calculation is e.g. +3. A calculations consists of 'some' pairs, in which
a pair is an operator (only '+' and '-' in this version) followed by a
primitive item.

** primitive: either a 6-digit number, or the text "mem". We are using
string input here, and we allow a maximum of 6 digits. But what about
floating-point? We could try to write the syntax for this, but a better
approach would be to use block input, and let Red recognise the types it
already knows about. This is shown later.

** I could have used less rules, but personally I find the extra names
improves clarity.

top

1.7. Inserting Red Code: ( ) , Copy, Set


We can insert Red code in our rules by enclosing it in parentheses. For
example, if we modify our calculator 'num' rule to:

num: [1 6 digit (print "got num")]

we will see "got num" displayed 4 times with our input of

in: {mem=+3+1234
display -4+3-mem
}

Note that if the rule is entered, but no match happens, code at the end of
the rule is not executed. At the console, for example:

<< rule: [(print "A") "start" (print "B") ] ;-- match "start"
<< parse "start" rule
A
B
== true ;-- matches OK

<< parse "st-art" rule
A
== false ;-- no match, B not displayed

Similarly, in a selection, if we have

some-rule: ["begin" | "end"]

and we add some code like this:

some-rule: ["begin" | "end" (print "In some-rule") ]

it will only be executed when "end" is matched. To see the message in
either case, we should also put it immediately after "begin".

Within (...) we can put any Red code, such as using variables and calling
functions.

The 'copy' word can be used to access the matched text. It is followed by
a variable name, and must directly precede a rule. (i.e. don't put any
bracketed code between 'copy' and a rule. Here is an example:

num: [copy number 1 6 digit (print[ "number: " number]) ]
digit: charset "0123456789"
parse "123" num

The rule is '1 6 digit', and it is preceded by a copy. We are free to
choose the variable name for copy. When the rule succeeds (as it does
above) the print is executed, showing 123.

Here is another example:

operator: [copy op "+" | copy op "-"]

Note the copy in each selection. Our 'op' variable can be referred to in
another rule. Later we will use 'set', which is similar to 'copy'.

There is also a 'copy' in Red, but the Parse copy is different.

Here is the same grammar, with 'copy' and parenthesised code used to
perform the execution:

total: 0
memory: 0
use-num: func [] [
either op = "-" [;-- update the total + or -
total: total - to integer! prim
] [
total: total + to integer! prim
]
]

program: [some [statement]]
statement: [[mem-instruction | display-instruction] newline]
mem-instruction: ["mem=" calculation (memory: total)] ;-- update memory
display-instruction:
["display" space calculation (print ["display: " total])]
calculation: [(total: 0) [some [pair]]] ;-- initialise the calculation
pair: [operator primitive]
operator: [copy op "+" | copy op "-"] ;-- remember the op
primitive:
[copy prim num (use-num) | copy prim "mem" (prim: memory use-num)]
digit: charset "0123456789"
num: [1 6 digit]

code: {mem=+3-4+1
mem=+mem+1
display -100-mem+101
}
print "Code:"
print code

either parse code program [
print "Checked OK"
] [
print "Error"
]

and here is the output:

Code:
mem=+3-4+1
mem=+mem+1
display -100-mem+101
display: 0
Checked OK

A function was used to do the addition or subtraction, but similar code
could have been embedded in the rules.

Finally, we will put some calculator code in a file, and read it in. We
create a file named (for example) calc-code.txt, and put our code there, as
in:


mem=+3-4+1
mem=+mem+1
display -100-mem+101

Now we modify our program to read the file:

code: read %calc-code.txt ;-- in Red, % identifies a file name
parse code program

top

1.8. Calculator With Block Type: Copy And Set


The above calculator worked, but the use of 6-digit integers was
unrealstic. Also, what if we wanted to change our calculator to work with
float! or time! types? Their syntax is not simple. In such
situations, we would use block input. Here is some input for the block
version:

code: [mem = + 4 - 4 + 1
mem = + mem + 1
display - 100 - mem + 101
]

We have to put spaces between items now. In fact, the input code is now
quite close to Red, and there are various ways of interpreting it.
However, we will continue with Parse, and compare it to the string version
above. Here is the Parse code:

memory: 0
total: 0
use-num: func [] [
either op = '- [
total: total - prim
] [
total: total + prim
]
]

program: [some [statement]]
statement: [[mem-instruction | display-instruction]]
mem-instruction: ['mem '= calculation (memory: total)]
display-instruction: ['display calculation (print ["display: " total])]
calculation: [(total: 0) [some [pair]]]
pair: [operator primitive]
operator: [set op '+ | set op '- ]
primitive:
[set prim integer! (use-num) | set prim 'mem (prim: memory use-num)]

The differences from the string version are:

** we have removed references to space and newline. We use the Red
approach of spaces separating items.

** the 'primitive' rule now uses the Red integer! type. All literal
types (e.g. pair!, time!, float! etc can be similarly matched.
We can use any-type! to match any item in the input.

** to match a word - such as mem, we use 'mem. Word matching is
case-insensitive.

** we used 'set' rather than 'copy'. In the above, the 'copy' variable is
made into a series type, even if it holds a single item, whereas the type
of the 'set' variable is what we want here (i.e integer! in the
'primitive' match).

Strictly, the value of 'copy' is the collection of matched items, whereas
'set' uses the first matched item. Here are two examples

parse "Stuff---" ["Stuff" copy d some "-" (print [type? d d])]
parse "Stuff---" ["Stuff" set d some "-" (print [type? d d])]

In the first example, 'copy' contains the whole match "---" as a string,
and in the second, 'set' contains the first matched item as the character
#"-".

Here are some copy/set examples using blocks:

parse [Stuff 12 34] ['Stuff copy matched some integer!
(print [type? matched matched])]
parse [Stuff 12 34] ['Stuff set matched some integer!
(print [type? matched matched])]

The output is:

block 12 34
integer 12

We can also read data from a file into a block with load, and use it
directly in Parse, as in:

parse load %block-in.txt program

Where block-in.txt holds:

mem = + 4 - 4 + 1
mem = + mem + 1
display - 100 - mem + 101 - 2222

top

1.9. Literal Types In String Input


Parse rules used with strings cannot contain type names (pair!,
integer!, float! etc). However, some literal values (not type
names) can be used. These are url!, email!, and tag!. Tag
values are the most common, as they are useful in HTML processing. Here
are some examples. They are all 'true':

print parse "Stuff>atag<" ["Stuff" >atag<]
print parse "Stuffhttp://me.com" ["Stuff" http://me.com]
print parse "Stuffme@super.org" ["Stuff" me@super.org]

Note the reduction of quotes in the rules, though we could also use:

print parse "Stuff>atag<" ["Stuff" ">atag<"]

top

2.1. Keywords inside/outside Parentheses


We have seen that rules can contain Red code in parentheses, and also
commands - such as 'copy' outside parentheses. There are a number of other
commands (keywords) that belong to Parse. Sometimes they have the same
name as Red words, but they belong to Parse, and in general they have a
different meaning. You have seen 'copy', 'set', 'some', 'any', etc and now
we will look at others. Many of them allow us to manipulate the input
series.

A full list is available in:
Introducing Parse, by Nenad Rakocevic.
href="http://www.red-lang.org/2013/11/041-introducing-parse.html">http://www.
red-lang.org/2013/11/041-introducing-parse.html


top

2.2. Skipping Input: Skip, To, Thru


These words let us move through the input.

'Skip' - this skips one item in the input series. For example:

rule: ["a" skip "b"]
print parse "axb" rule ;-- true

Above, the rule states:
- match an "a"
- skip over any character
- match a "b"

Using the above rule with different data:

print parse "axc" rule ;-- false - no b
print parse "axxb" rule ;-- false - no a after first x

Here is a block input example, which matches an integer, skips the next
item (an integer here), then matches a string. Parse returns 'true':

print parse [123 456 "Hello"] [integer! skip string! ]

Now some more 'skip' examples, all true, using a range:

print parse "axxxb" ["a" 1 3 skip "b"] ;-- a, do 3 skips, b
print parse "xxxa" [3 skip "a"] ;-- 3, skips, a

The 'thru' facility skips up to a specified item, as in:

parse "axxxbc" ["a" thru "b" "c"] ;--true: a, skip to b inclusive, c

In the above, 'b' is also skipped, and matching continues after it.

Note that this also works for multi-character strings in rules, as in:

parse "axxxxxbeec" ["a" thru "bee" "c"] ;-- skips to bee, matches c

and from the start, here using a sub-rule:

animal: ["black " "cat"]
parse "whatever---black cat" [thru animal] ;-- true

Alongside 'thru' there is the similar 'to', which does not include the item
which ends the skip. In the following example, the first 'b' is detected
but is not part of the skip. The second 'b' in the rule will match it:

parse "axxxxxb" ["a" to "b" "b" ] ;-- up to b, not including it.

Here is an example which copies the title of a web page:

html-code: ">html< >title<My Great Page>/title< Contents
here... >/html<"
parse html-code [thru >title< copy the-title to >/title<]
print ["Title is: " the-title] ;-- prints: My Great Page

We used the tag type in the rule, but could have used e.g.
">title<". Note the use of 'to'. If we used 'thru', then the
result is:

My Great Page>/title<

.top

2.3. Using Variable:


We can create a variable, and set its value. This is a Parse feature.
Here are some examples. First we look at getting a position in the series:

parse "ABCDEFG" [thru "D" place: (print place) ] ;-- EFG printed

We created the 'place:' variable. Here, we skip to "D", including it.
Then, 'place' becomes a reference to the current input position. Here,
this is from "E" to the end.

Here is an example which finds the positions of items between START and
STOP. It uses the index? series function to get a numeric position
from the reference to the series.

parse [A B START C D E STOP F G] [
thru 'START pos1: to 'STOP pos2:
(print [index? pos1 "-" (index? pos2) - 1 ]) ;-- prints 4 - 6
]

If we display the two positions as references without using index?, we
would get:

C D E STOP F G - STOP F G

Here is another example of finding positions, this time in HTML text. We
are interested in >h1< items.

page: {>html< >title< My Great Page>/title<
>h1< Big Heading A<>/h1<>p<Stuff in A >/p<
>h1< Big Heading B<>/h1<>p<Stuff in B >/p<
>/html<
}

;-- positions of text in an h1
parse page [ any [thru >h1< h1-at: (print ["h1 at: " index? h1-at])]]
;-- position of the > in >h1<
parse page [ any [to ">h1" h1-at: thru "<" (print ["h1 at: " index?
h1-at])]]

The first example uses an unquoted tag in the rule, and 'thru' includes the
whole tag.

The second example uses a quoted string to find ">h1", notes the
position, then moves to the closing "<".

Note that if we try to do the second example with 'to', as in:

any[ to >h1< ...]

Then parse gets stuck, repeatedly finding the same >h1<.

top

2.4. Using :Variable


Here we look at modifying the input series. This example finds the start
and end of the page's title text, and uses these references to modify the
input series:

parse page [thru <title> begin: to </title> ending:
(change/part begin "A Better Title" ending)
]


top

2.5. Using a Variable


We have looked at the use of ':word' and 'word:'. If we use a word
without a colon, its value is looked up and used, as in:

a-word: 3
print parse "xxxa" [a-word skip "a"] ;-- true

We could use this approach for counts, match-values etc.

top

2.5. Remove, Insert, Change


These Parse keywords modify the input series. They can be more convenient
than using Red code in parentheses.

We can remove the matched input, as in:

inp: "XXX-------"
parse inp [any "X" remove any "-"] ;
print inp ;-- prints XXX

We can insert an item (e.g. a string, a block) at current input position.
Scanning continues past the insertion.

inp: "XXX-------"
parse inp [any "X" insert "ABC" any "-"] ;
print inp ;-- prints XXXABCabc-------

We can change the match input to a given value:

inp: [12 "a string" "b string" 22 44] ;-- unordered strings, integers
parse inp [some[string! | change integer! "NUMBER" ]]
print inp
print ""

The output is:

NUMBER a string b string NUMBER NUMBER

top

2.6. Collect And Keep


Parse has its own collect and keep functions, similar to those in Red.
Here we parse a block of various types of item, and keep only the
integers. Keep can be used several times. Note that we match an integer
before the more general any-type!. There is also a 'pick' option for
'keep', which allows control over the storing of the selection.

in: [12 3x44 "a string" 13 a-word 14]
;-- prints 12 13 14
print parse in [collect[some[ keep integer! | any-type!]]]
The 'collect' function returns a block via Parse, i.e. the normal Parse
logic! result is not returned.


top

2.7. Parse into/ahead (nested blocks)


Programming languages usually have some way of delimiting nested code, such
as curly-brackets in C++, Java. It is possible that your DSL might need nested
items. We can make use of Red's square brackets for this. Here is an example of
a nested structure. We can make the parser go into nested blocks with 'into'.

First we look 'ahead' to see if the next item is a block.

;-- a series of: integers or nested integers
list: [22 33 [44 [55] 66] 77]

nested-rule: [ any[ item ]]
item: [copy num integer! (print num)
| ahead block! (print "got a block")
into nested-rule (print "out of block")]
parse list nested-rule

The output is:

33
got a block
44
got a block
55
out of block
66
out of block
77

top

3.1. Debugging - parse-trace


To look into how your rules are working, you can insert printing to provide
a trace, as in:

primitive: [ num | "mem" ] ;-- original rule
;-- with prints
primitive: [num (print "got num") | "mem" (print "got mem")]

You can also use the 'parse-trace' wrapper for Parse, as in:

parse-trace some-input a-rule

This displays a detailed trace of the parse process.