User:Steamerandy/sandbox/SLIC
SLIC
[edit]SLIC is an acronym for (System of Languages for Implementing Compilers). Developed 1969-1974 it is a compiler-compiler, or metacompiler inspired by CWIC (Compiler for Writing and Implementing Compilers). It follows the transformation phases described in the early 1964 ACM paper: A GENERAL-PURPOSE TABLE-DRIVEN COMPILER by Stephen Warshall and Robert M. Shapiro in which the general concepts are explained:
If a compiler is to generate efficient object code, there are several different kinds of optimization which should take place. Each of these optimization procedures has a preferred domain: that is, some algorithms prefer to operate over the input string, others over the tree which describes the syntax of the string, others over the "macro-instructions" which are generated from the tree, and so forth. In an earlier paper, one of the present authors pointed out the necessity for employing the tree form in particular as a natural domain for optimizers which consider syntactic context and suggested that, just as Irons and others had built general-purpose table-driven parsing algorithms, one could also build a general-purpose table-driven program for getting from trees to macro-instructions. Full pdf
SLIC is the result of pursuing that kind of thinking, breaking down the translation and code generation process into five destinct program transformation steps or phases as originally proposed in Warshall and Shapiro paper. Each phase coded in a specialized domain specific language that compiles to executable machine code.
Edit 223
[edit]- Grammar specifications are coded in the SYNTAX, scannerless parser programming sub-language. Transformation operators create abstract parse trees.[NOTE 1]
- The GENERATOR sub-language performs optimizations on Abstract Parse Trees and transforms them into sequential "n-address macro", abstract machine PSEUDO instructions.[NOTE 2]
3. ISO (In-Sequence-Optimization)
- PSEUDO code to PSEUDO code transformations are coded in the ISO sub-language. The ISO phase is run before PSEUDO expansion.[NOTE 3]
4. PSEUDO expansion (Code selector)
- Machine code production is coded in the PSEUDO language. PSEUDO procedures are used to model an idealized abstract machine architecture for the language and/or a machine archatecture class.[NOTE 4]
5. MACHOP (assembler)
- MACHOPs define machine instructions and their assembly parameters. They are programed functions that output bit field sequences.[NOTE 5]
SLIC's objectives:
- 1. Make it simpler to target different processor instruction set architectures, Making a clean separation between language analysis and code generation for specific targer machines..
- 2. Improve the readability of code produnction.
Like CWIC. SLIC compiles to machine executable code. It's parser and generator languages are vary simular to CWIC.
The sub-languages can be divided into linguage and target machine specific catagories:
The SYNTAX and GENERATOR languages are target language specific. Translating input source code into sequential abstract machine PSEUDO instructions.
The PSEUDO proceedures and MACHOP programming are target machine specific.
An ISO transform may or may not be target machine and/or language specific.
When developing SLIC the main objective was to be able to produce compilers for different computer systems ideally by linking a language and machine. If we have language A and B and target machine x, y and z. we could link the language and machine parts to get six compilers. If SLIC were say the A language then we could have 3 B compilers running on each of the three machines. A total of 12 compilers.
In reality PSEUDO instruction are a mix. There are language specific PSEUDO instructions. Or language class specific PSEUDO instructions. Others may be completely language independent.
But what came out of the SLIC project was a compile-compiler that substantially reduced compiler development effort. A COBOL 68 compiler was completed in less then 6 man mounths. An equilivant COBOL compiler from Digital Equipment Corp took 8 man years. And our COBOL compiler was so equilivant we had the same record contains clause bug. DEC had extended the file contains clause:
- file_contains_clause = "FILE" "CONTAINS" integer ("RECORDS"|"RECORD"|.EMPTY) ("."|.EMPTY);
Well the shit hit the fan when the file_contains_clause is followed by a record_contains_clause and the programmer chose the .EMPTY option writing:
- FILE CONTAINS 1
- RECORD CONTAINS ...
You see the compilers took it a
- FILE CONTAINS 1 RECORD
- CONTAINS ...
and not recognizing CONTAINS reported it as an error.
Variables and objects
[edit]SLICs is in part based on the LISP 2 list procrssing language in which a dynamic memory system is included. Every datum is an object and all variables are object holders that may contain any object type. In one instance an integer, on another a float, and on yet another a list or tree. There are many object types in the SLIC system. Some common like integer, float, and string are atomic data types. Symbols, created by token rules, are cataloged dictionary objects and may have associated attribute variables. A list is an object container. A tree is a common structure used in compilers. In SLIC a tree is simply a list whose first element is a node object. Node objects are globally recognized by their name string but may have instanced attributes.
Type | Example | Discription or use |
---|---|---|
string | "abc" | |
integer | 25 3 7 | |
float | 25.4E-5 | |
list | [5,"string",25.8] | |
tree | ADD[x,MPY[ADD[z,b],5]] | |
symbol | x y sam | attribute loc:(x) undef |
gensym | %T1 %T2 %R1 %L2 | loc:(%L2) |
symbol table | internal structure | |
character | '9' '$' | single character string |
node | :ADD :SUB :FUN | ADD[x,y] |
address | section`offset | undef |
••• | ••• |
Abstract Parse Tree
[edit](3•x^2-5)/(x+4) -> DIV / \ SUB ADD / \ / \ MPY 5 x 4 / \ 3 POW / \ X 2 DIV[SUB[MPY[3,POW[x,2]],5],ADD[x,4]]
The APT (Abstract Parse Tree) is an Abstract Syntax Tree structure created using transform operators :<node name> and !<number> in the parser language. The creation method is different then described in the Abstract Syntax Tree topic but otherwise they are functionally the same. Two operational stacks are used. An Abstract Parse Tree is constructed on the parse-stack from parse-stack and node-stack entries.
The :<node name> sequance creates a node object and pushes it onto the node stack. :ADD creates an ADD node. :MPY creates an MPY node. A tokens is placed on the parse stack when a token rule is successful. The !<number> operators is used to combine the top node stack entry with the top <number> of parse stack entries. It create a list whose first element is the top node stack entry. The <number> parameter specifying the number of parse stack entries (branches) to be poped into the list. The list is then pushed onto the parse stack. APTs on the parse stack inturn become limbs of trees or are passed to functions coded in the generator language. The transform rules below perform the arithmetic expression transformation illistrated above.
expr = term $(('+':ADD|'-':SUB) term!2); term = factor $(('*':MPY|'/':DIV) factor!2); factor = ('(' expr ')'|number|id) ('^'factor:POW!2|.EMPTY);
An equivalent functional list notation:
(3•x^2-5)/(x+4) -> DIV[SUB[MPY[3,POW[x,2]],5],ADD[x,4]]
is commonly used in SLIC documentation.
A tree is a list whose first element is a node:
- [DIV,[SUB,[MPY,3,x],5],[ADD,x,4]]
Tree building is a stack automation consisting of a node stack and a parse stack. Tokens are placed on the parse stack by token rules. The : operator is used to push a node on the node stack. The !<num> pops the top node and combines it with the <num> number of parse stack entries. Poping the entries and pushing the created tree.
Abstract Parse Tree
The Abstract Parse Tree or just APT is a list whose first element is a node string. A tree of two branches x and y whose node is ADD:
ADD / \ x y
is displayed text formated as:
ADD[x,y]
or in list format:
[ADD,x,y]
The SLIC debugger can be set to display trees in either form. In the syntax language the operators : and ! are used construct APTs. The APT is built on and from the parse and node stacks. The : operator pushes the node name following it onto the node stack. The :::!::: operator takes a number following it and creates a list whose first element is poped from the node stack and combined with the top number of parse stack elements. The list is the new top element on the parse stack. The tree branches having been removed and placed in the constructed list. There are several methods of placing objects are on the parse stack. Besides the '!' number described, a sequence of parsed elements may inclosed by +[ and ]+ to form a list.
part 2
[edit]The :<node> operator is used to create and push a node object on the node stack. The !<number> operator builds a list whose first element is poped off the node stack. The top <number> of parse stack entries are poped filling out the list. The list is pushed on the parse stack. The transform rules below perform the arithmetic expression left-most tree derivations transformation illistrated above.
expr = term $(('+':ADD|'-':SUB) term!2); term = factor $(('*':MPY|'/':DIV) factor!2; factor = '(' expr ')'|number|id;
Parsing the expression:
(3•x-5)/(x+4) ->
An equivalent functional tree is is built on the parse stack:
DIV / \ SUB ADD / \ / \ MPY 5 x 4 / \ 3 x
A tree is a list whose first element is a node:
- [DIV,[SUB,[MPY,3,x],5],[ADD,x,4]]
The !<num> pops the top <node name> on the node-stack and combines it with the top <number> of parse stack entries. Poping the entries and pushing the created tree onto the parse stack.
List operators +[ and ]+ are used to create a list of parsed items.
arg_list = +[argument, (',' argument)* ]+
+[ and ]+ are grouping operators that when enclosing a parse sequence create a list of the parsed items.
MACHOP
[edit]MACHOPs are used to define assembly instructions. A .MACHOP outputs machine instruction fields. Arguments are coded as machine instruction parameters. In defining machine instruction parameters, operators such as auto increment, auto decrement and address indirection are specified by their operator presence in a peramater specification. Machine registers or memory addresses etc can be deturmined in the MACHOP sub-language. The output is in a relocatable format including polish prefix fixup strings or expresions. In PSEUDO and MACHOP functions fixup blocks are created by references to undefined (.UNDEF) values. The compiler outputs fix up blocks when all (.UNDEF) values are resolved. Unresolved fix up expression at the end of compilation (indicating undefined symbols) are reported as errors.
A machine instruction's assembly mnemonic opcode, parameters and operands are specified by a MACHOP. Machine instructions may include modifier operation flag characters. In the MACHOP language indirect addressing is indicated by an @ character. Indexing by grouping a register in parentheses.
That includes variables that are arguments assigned the instruction parameters. A bit field record like structure language is used to define binary bit field sequences. When a MACHOP is called the bit fields programed in the MACHOP language are output or planted. Again readability is important. The MACHOP also includes display formating for optional output of assembly like code in the program listing. A handy feature in debugging the compiler and for programmers using it. A main goals in developing SLIC is ease of use.
For example user mode DEC-10 instruction have a
- 9 bit opcode
- 4 bit register
- 1 bit indirect flag
- 4 bit index register
- 18 bit address. It is a 36 bit word addressable machine. The MACHOP header defines the assembly format of the instruction.
- .MACHOP #OP %AC,@I ADDR(INDEX;0)
The declaration above defines the assembly format and operands of all DEC-10 user mode instructions. The parameters:
- #OP instructions opcode.(# vectored)
- %AC accumulator or register.
- @I indirect flag.
- ADDR address or offset.
- (INDEX|0) index. (|default =0)
Defines an instruction having an opcode and two operands separated by a comma. The first is simply a register or accumulator. The address part is more complex having inirection and may be indexed. The @I parameter is indirection. The variable I is set to 1 indicating the presence of the @ indirect flag otherwise is 0. The address or offset field ADDR is a variable containing an address or offset value. INDEX is expected to be a register or numeric value. The #OP indicates a vectored opcode entry. OP is a variable assigned a corresponding value associated with an instruction's mnemonic from the mnemonic table following the instruction procedure body. The body of a .MACHOP usually starts with a .MORG (modular org) statement that aligns the plant offset.
- .MORG <alignment>:<format>:<value>;
This directive is important to understand. The first parameter <alignment> sets the starting $ plant location. $ is the plant location offset. The DEC-10 is a 36 bit word machine. So we won't to start planting an instruction on a 36 bit word boundary. The second parameter +O(18) is print format specification. The DEC-10 has an 18 bit user-mode address. We wish output in octal. The +O(18) outputs an 18 bit value in octal starting a new line indicated by the +.
.MACHOP #OP %AC,@I ADDR(INDEX)-> .MORG 36: +O(18): $/36; O(9): OP; opcode field O(4): AC; register field O(1): I: indirect address flag O(4): INDEX; index register field O(18): .IF ADDR .SYMBOL .THEN LOC_ATTR:(ADDR)/36 .ELSE ADDR; #ADD .O270; #ADD .O273; #ADDI .O271; #ADDM .O272; #AND .O404; #ANDB .O442; #ANDCA .O410; #ANDCAB .O411; • • • #POP .O262; #POPJ .O263; #PUSH .O261; #PUSHJ .O260; • • • #XPRI .O431; #XORM. .O432;
The partial opcode table above (#mnemonic value) pairs contain some of the instructions the MACHOP produces code for. The vector variable OP is assigned the corrosponding numeric. i.e. an ADDI add immediate instruction would assign OP the octal value 271. That would be output by the first 9 bit field:
- O(9): OP;
Each field is defined by a display radix and size. <radix>(size): <value expression>; <comment> The
- .MORG 36: +O(18): $/36;
statement starting the instruction production aligns the output to start on an addressable boundary of the target proceszor. In the case above a 36 bit word boundry. We are generating machine code into a bit addressable memory space. Addresses are into these bit addressable memory spaces.
.MORG 36: +O(18): $/36; | || | | | || | °plant address. | || °Field size. | |°output radix. | °start new line. °Align on a 36 bit boundry
The assembly print formating of the instruction location. 18 bit value $/36 output in octal. $ is the plant pointer. The + specifies a newline.
- Token rules recognize character sequences (symbols, literal strings and numbers) transforming them into token objects.
- <token> .. <test conditional expression>;
- The parser is programed using test functions resembling phrase structure/constituency grammar rules. Using Tree, list and node operators transform the recognized constituent into an abstract parse trees.
- A rule assigns a unique name to a constituent or token (test) function defined using a conditional expression. That function may be called by referencing it's name to recognize the constituent.
A generate function is in a way like an overloaded function in C. Only the argument type recognition is an active runtime process. A generator function is of the form generator_name followed by one or more '"unparsr => action pares i.e.
-
- <name> [<unparse1>] => <action1>;
- [<unparse2>] => <action2>;
- •
- •
- •
- [<unparse2>] => <action2>;
- <name> [<unparse1>] => <action1>;
Pseudo code is planted into named sections. Planting is the term for generating a machine code instruction. Planting them into named blocks of memory called section. SLIC instead appends the PSEUDO instruction to a sections code list. At some point the section is flushed and the PSEIFO code procedures are called to produce machine code. PSEUDO code execution proceeds through the section code list calling each PSEUDO instruction in order. PSEUDO instructions call machine operations to output machine code to the object file. The PSEUDO language is basically the generator procedural language only having a simple argument list instead of a unparsr rule. They plant code with MACHOP calls executed immediately. PSEUDO procedures are like assembly macros that expand to machine instructions. Only pseudo procedures are executable compiled code.
edit b
[edit]A symbol table contains symbols parsed from the source. A symbol table is an object and may be assigned to a variable or a symbol attribute. Tree objects are lists whose first entry is a node object. Node object are created in the syntax language using the :<node symbol>. A node is a special symbol each created node instance has it's own attributes. Lists are usually displayed inclosed in [ ... ]. A tree (recognized by it's first element being a node type) is displayed as: <node symbol>[ ... ]. i.e. ADD[x,MPY[Z,5]].
The above is a 3 element list whose third element is a 3 element list. In tree digram form:
ADD / \ X MPY / \ Z 5
edit c
[edit]CWIC was the inspiration for developing SLIC, Its syntax and generator languages are nearly identical.
The ideal compiler compiler
[edit]The goal of an ideal compiler-compiler.
These compiler writing and implementing systems approach the ideal compiler-compilers goals:
The ideal compiler-compiler takes a description of a programming language and a target instruction set architecture, and automatically generates a usable compiler from them.
From the CWIC ACM paper:
In its most general form a metacompiler is a program written for a Machine M which will accept specfication for a programming language LJ; and its equalivent in the language of Machine Mi, and produce a compiler which runs on machine M. Source programs which are input to this compiler are written in language LJ. The output of this compiler is object language which runs on Machine Mi
- The ideal compiler-compiler takes a description of a programming language and a target instruction set architecture, and automatically generates a usable compiler from them. In practice, the state of the art has yet to reach this degree of sophistication. compiler-compiler
The main goal in developing SLIC was to be able to specify the generation of executable code for any target computer instruction set. Seperating the language and target machine specification. SLIC is about as close to the ideal compiler compiler as it gets. You have the language specification and it's semantic specifications. The target computer, machine dependices, needed to be separated out. Targets machine specifications could then be coded separately and linked with the language. A clear separation of source language processing and target machine specification. So instead of producing binary code in the generator language as CWIC does SLIC produces pseudo code. A process called planting in both systems.
The result is a compiler specification in comprehensible, readable symbolic form. It achieves the separation of machine code generation from the tree crawling generator language. The basic strategy is close to the ideal compiler compiler. Translating the source code to intermediate macro like instructions that expand to the target processor's machine code.
System of languages refers to five special purpose languages following the idealized compiler organization first envisioned by Stephen Warshall and Robert M. Shapiro:
- If a compiler is to generate efficient object code there are several kinds of optimization which should take place. Each having a preferred domain: that is, some operate best on input strings, other over the tree structures which represent the syntax of the strings, others over the sequntialized instructions, and so forth. (Stephen Warshall and Robert M. Shapiro: (A GENERAL-PURPOSE TABLE-DRIVEN COMPILER, 1963)[1]
SLIC, though independently developed, follows their strategy having five phases. Each written in a specialized syntax designed for a spicific use in the translation of a high level programming language to binary maching code. It should be noted that a phase is not equilivant to pass, as in a multi-pass compiler. So although the goal of the five phases were for different reasons both goal are accomplished.[* 1]
Phase 1 Grammar analysis
[edit]program = $((declaration | .EOF .STOP)\ ERRORX["Error"] $(-';') ';"); declarations = "#" dirictive | comment | global DECLAR(*1) |(id (grammar PARSER(*1) | sequencer GENERATOR(*1) | optimizer ISO(*1) | pseudo_op PRODUCTION(*1) | emitor_op MACHOP(*1)) \ ERRORX("!Grammer error") garbol); grammar = (':' class :CLASS // character class define |".." token :TOKEN // token rule |"==" syntax :BCKTRAK // backtrack grammar rule |'=' syntax :SYNTAX // grammar rule. )';'!2 // Combine name and rule tree $(-.LS -"/*" .any); comment = "//" $(-.LS .any) | "/*" $(-"*/" .any) "*/"
The grammar analysis is coded in the syntax language. It is not a grammar from which a parser is generated, but a stack-oriented string processing functional programming top down analytical reductive phrase structured grammar specification language in which a parser is coded. One might say this distinction is a moot point. But in programming the parser is vary important.
The grammar rules fall in the functional programming paradigm. Their implicit input a character sequence from the source stream. They match tokens or phrases returning success or failure. On success the input stream is advanced over matched characters. On failure the parse state is unchanged.
Syntax and token rules are compiled into boolean functions returning success or failure. Their input is the input character stream. Their output is parsed constructs or objects on the parse stack. A grammar rule returns success(true) or failure(false) depending on its matching the language construct pattern. A rule is a formula that specifics a pattern. It may include tree or list transform operators and generator function calls. There are three rule types. Character class, Token and syntax rules. Rules are of the form rule_name followed by the rule type equilivance operator followed by it's structure specification and terminatedo by a semicolon character. Generator calls and transform operators may be used in the rules.
- <name> <equilivance> <pattern> ;
Character class rules
[edit]bin: '0'|'1'; oct: bin|'2'|'3'|'4'|'5'|'6'|'7'; dgt: oct|'8'|'9'; hex: dgt|'A'|'B'|'C'|'D'|'E'|'F' |'a'|'b'|'c'|'d'|'e'|'f'; upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G' |'H'|'I'|'J'|'K'|'L'|'M'|'N' |'O'|'P'|'Q'|'R'|'S'|'T'|'U' |'V'|'W'|'X'|'Y'|'Z' lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g' |'h'|'i'|'j'|'k'|'l'|'m'|'n' |'o'|'p'|'q'|'r'|'s'|'t'|'u' |'v'|'w'|'x'|'y'|'z'; let: upr|lwr; alphanum: let | dgt; symchr: alphanum | '_'; skip_class: 0h08|0h09|0h0A|0h0B |0h0C|0h0D|' ';
A character class, defined by the ":" equilivance operaror following its name, is used to define or name a set of characters. A named character group literal used in syntax and token rules that work as a literal matching any character of their class. Generally they are used in defining tokens. A character class is defined by a list of characters, each separated by the "|" alternative operator. Characters are specified by their quoted glyph or numeric code.
bin: '0'|'1'; // quoted glyph form
A bin can be the character 0 or the character 1. Numeric code values for the characters may be used in place of 0 and 1 the bin class expressing the identical same set of quoted glyphs using numeric ASCII codes:
bin: 0H30|0H31; // hex numeric codes
Using bin in a syntax or token rule will match any characteor in the bin character class. Only character constant's or literals and previously defined character classes may be used in character definations. A class includes all characters in the named class as illistrated by the folllowing commonly used character classes:
Character class rules normally do not generate executable code. In implementation they are not called but generate in line code. They are used in token rules to match characters of a token. Characters thus matched are appended to the token string. Used in syntax rules they will match a character belonging to the class, like a literal string match in a syntax rule the character is not normally kept. The action modifiers + - and ? may be applied to a character class in a syntax rule. The built-in .any character match operator will match any character code only failing if there is no character .i.e at the end of input. The .any operator is used and performs like a character class though its implementation is quite different in only checking file status and only failing at the end of input.
Character classes are implemented using a class table that is indexed by a character's numeric code. The indexed entry holds class member chip bits. A single indexed test deturmins class membership.
bin | 0b00000010 | |
oct | 0b00000110 | oct includes bin |
dgt | 0b00001110 | includes bin and oct |
hex | 0b00011110 | includes bin oct and digit |
upr | 0b00100000 | Pure class |
lwr | 0b01000000 | Pure class |
let | 0b01100000 | existing classes only. |
alphanum | 0b01101110 | existing classes only. |
symchr | 0b11101110 | added '_' |
skip_class | 0b00000001 | Predefined class |
It may be worth noting that implementing the above character classes, using a class map bit table, only requires 8 bits. Each class is a bit mask. A bit is only required for a class that has characters. One made up wholly of other classes, let and alphanum, require no allocation of a class bit.
Character class rules combine in the generation of a character class table. The class table is built such that an entry indexed by a character's numeric value contains its class memberships. The letter 'A' belongs to two classes upr and hex. Its table entry would be 0b01010000
A character's numeric value is used to index into the class table. When a table entry masked with a class mask is nonzero the character is a member. Fast test of class membership. Most modern processors are able to test character class membership with a single instruction. With the character in the eax register of an IA86 processor a single test instruction is able to test class membership with a test instruction. A conditional JE or JNE instruction follows.
test _class_tbl[eax],class_name jn is_a_member
Token rules
[edit]A token rule signified by the token defining operaror ".." following its name, is used to recognize and create objects of character sequences that make up the words, numbers, symbols and strings, of the language.
In many compilers tokenizing or lexical analysis is a separate pass. In matacmpilers tokenization is part the grammar analysis phase. Recognizing a token is directed by syntax rules. Token rules are not the normal regular expressions commonly used today. The first major difference is they are easer for a layperson to read. A token rule has a name that should be descriptive of its function. An identifier rule for example could be named id. Or identifier if we weren't so lazy.
id .. let $symchr;
The '$' operator (a Kleene star equivalent) is the sequence operator. Literally meaning zero or more of the following. The above rule could be written as:
id .. let $(let|dgt|'_');
The compiler could be programed to recognized that (let|dgt|'_') is equivalent to symchr in function. However it does not do so. Were (let|dgt|'_') generates a test for each alternative symchr results in a single test. This is a compiler writing system. The programmer is expected to have of code generated and use appropriate coding practices.
Token rules skip leading skip_class characters in the character stream until a character match is found. The skip_class character skipping algorithm allows for recognizing a leading skip_class character when specified in the rule. With only one exception token rules may not call functions. The one exception is to a generator function at the conclusion of the token recognition. The generator call may only appear as the last component of a rule or an outermost alternative. Functions may be called to alter default token string processing. These functions include string to numeric token conversions. Default token processing is as a symbol table entry. Calling makestr[] creates a string token. Embedded action code may also be used to test symbol attributes. Binary, octal, decal, hex, and floating point conversion functions are available. Token rules return success or failure. On success the token object is pushed onto the parse stack. On failure the input stream is reset to it's state on entry.
Many compilers have a seperate lexical pass. That is not the case here. Tokens are parsed on the fly. Token rules create atomic objects or words, strings and numbers of the target language. Token rules may only use static matching. That is they do not make calls to other rules. A conversion function may be called but only as the last operation before terminating. A matched, created, token is pushed on the parse stack(*stack).
Some token rule examples, part of the metasyntax language
integer .. ("0b"|"0B") bin $bin MAKEBIN() // binary integer |("0o"|"0O") oct $oct MAKEOCT() // octal integer |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX() // hex integer | dgt $dgt MAKEINT(); // decimal integer; id .. let $symchr; string .. """" $(-"""" .any | """""","""") """" makestr(); char .. ''' .any ''' makestr();
Token operators are described in table below
With lixical processing done on the fly. The skipclass character class defines white space and non-displaying characters that are skiped when lookilng for a token. Skipclass skipping is automatic until a character is matched by the token rule. No white space is allowed between any of the successive character matches unless specifically expressed in the rule. Normally a token rule when successful would create a symbol, dictionary, object. A dictionary token would then result. Symbols are automatically cataloged in the dictionary and the symbol object placed on the parse stack. Every token matched by a token rule will automatically be pushed on the parse stack. Functions that intercede the normal symbol processing can be used to create other object types.
Below is the metacompilers integer token rule. The leading radix code strings preceeding the numeric string determine valid characters and the interceding conversion function creates an integer object. Alternatives parse the various radix strings. A binary integer starts with 0b or 0B followed by one or more bin characters. 0b100101 is a valid sequence defined by the first alternative. The character class bin was defined in the character class section above.
. integer .. ("0b"|"0B") bin $bin MAKEBIN() // binary integer |("0o"|"0O") oct $oct MAKEOCT() // octal integer |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX() // hex integer |dgt $dgt MAKEINT(); // decimal integer
The IA86 asembly code is is shown below.
The integer rule above has alternatives separated by the '|' characters. ("0x"|"0X"|"0h"|"0H") defines the following character strings identifying it as a hexadecimal integer. Ox or 0X or 0h or 0H preceed a hex, base 16, number. The use of MAKEBIN(), MAKEOCT(), MAKEHEX(), and MAKEINT() intercede default dictionary entry, converting the sequence of characters matched into an integer object, and pushing the object onto the parse stack. Conversion function may only be used at the end, as the last entry, in an outermost alternative of a token rule. The integer rule illustrates how the programmer controls parsing and placement of conversion functions. A decimal number having no radix prefix is the last alternative . Alternatives are tried in order resulting in the first matched satisfying the rule. The programmer is responsible for avoiding ambiguities. Strings, for example "0b", matched in token rules are not kept as part of a the token. A +<string> may used when it is desired that a string be kept as part of a token. A string may be inserted into a token using the insert ,<string> operator.
The intercede functions are generator functions that return success or failure. If it were required to have a radix specifier at the end of an integer it would be a simple matter to parse all possible numeric characters and have the conversion function check for valid radix characters.
Token backtracking is confined to the rule automatically backtracking to their start point and returning failure.
A higher level grammar rule could be used. With a programmable parser we are not confined by the parser.
binary .. bin $bin ("b"|"B") MAKEBIN() // binary integer octal .. oct $oct ("o"|"O") MAKEOCT() // octal integer hexnum .. hex $hex ("x"|"X"|"h"|"H") MAKEHEX() // hex integer decmal .. dgt $dgt MAKEINT(); // decimal integer integer = binary | octal | hexnum | decmal;
In the above we have used a syntax rule with alternatives having the radix character appearing on the end of the number string. Token rules automatically backtracking to their start point are a simple reseting of the input stream to a non-stacked saved point. A token rule does not long fail. The above set of rules is not significantly different then the first integer rule.
Token operators
[edit]There are two string literal forms. The single character string bounded by single quote marks '<character>' and the multi-charater string bounded by double quote marks "<string>". Either may be used as a string. Where a single character is specificly specified the single character string is used. Dot word operators may be in upper or lower case. .any or .ANY are equilivant.
OPERATOR | OPERAND | FUNCTION |
.any | none | Matches any character advancing input stream. Appends matched character to token. Fails on end of input. |
.LNBRK
.LB |
none | Line break anker. Used to match start or end of a line. Used as first test in a rule it will match the start of a line, skiping skip_class characters. Used otherwise it will match skip_class character to a line break. |
quoted
string |
"<string>"
'<char>' |
The literal character string is matched aginst the input stream. If successful the input is advanced over matched characters. Quoted characters '?' and strings "xyz" are literal character strings. |
+ | "<string>"
'<char>' |
String is matched aginst input stream. If successful it is appended to the token and input advanced over matched characters. |
- | "<string>"
'<char>' |
Fails if string matches input stream. Input is not advanced in any case. |
~ | '<char>' | Matches any character not its argument. Any other character is appended to the token. |
? | "<string>"
'<char>' |
Succeeds if string matches input stream. Fails otherwise. Input is not advanced in any case. |
, | "<string>"
'<char>' |
Comma appends string to token. ,'"' appends a " character to the token. |
$ | <test> | Zero or more loop operator. Loop following <test> construct until failure. Always successful |
-- | none | --) describes a to be optional. |
.EMPTY | none | Always successful. Used as the last alternative makes it valid that none of the alternatives being matched is valid. The -- operator replaced this operator. |
Note. These operatores may operate differently in syntax rules.
Syntax
[edit]Syntax rules define language structure.
Phase 2 Code generation
[edit]Code generators are again boolean functions returning success or failure. They are a combination of TREE-META's tree pattern matching unparse rules and block structure LISP 2 like actions. A grammar rule initiates code generation by calling a generator function passing it an abstract parse tree. Generated code, macro like pseudo instructins, are planted into named sections. The terms plant or planted are CWIC terms for generating binary code into named memory blocks. SLIC is not generating binary code in this phase. Instead pseudo (macro like, executable) instructions are appended to a named section's instruction list. Sections are used to separate processor instructions and data. A flush statement naming the section is used to initiate the pseudo instruction expansion. Upon flushing a section the pseudo code may optionally be transformed by the In Sequence Optimize before expansion to machine code.
Phase 3 In Sequence Optimizer
[edit]The optionally executed ISO language transforms operate on sequential pseudo instruction patterns. They can rearrange code sequences and/or transform to different instructions. ISO transforms are like search and replace of test editors. Scanning a pseudo list matching pseudo sequences. Some optimizations are better performed in the generator language.
Phase 4 Pseudo expansion
[edit]Pseudo instructions are queued procesure calls that are appended to a section list. The execution defered until the section is flushed. They have a set of atomic arguments. The procedural body is the same LISP 2 generator language. They are a single procedure taking a list of arguments and generate binary code, calling functions resembling assembly instruction. They can also plant pseudo code as well. The have the same procedural power of a generator action. Data is generated in the same way as machine instructions. Constants, temporary values, and variables are created in sections using pseudo instructions.
Phase 5 Machine code output
[edit]With the goal of self documenting code. A way of defining assembly instruction and conversion to numerical machine code was needed. Sense computers came in many forms having different size addressable units. Six bit character machines. IBM 1401 and Honeywell H200. The Honeywell 800 and Borrows ALGOL B5500 and 6000 series machines were 48 bits. The IBM 7094 a 36 and DEC-System-10 36 bit word machines. PDP-8 having 12 bit words. The PDP-11 a 16 bit word 8 bit byte addressable. IBM 360 having 8 bit addressable bytes. Micro computers many having 8 bit byte addressable units. CWIC generated code for 8 bit machines. The machine operation language was conceived to define machine code. The concept is a procedure the generates binary. Machine instruction are made up of fields. Opcode, register, index, memory address etc. An assembler translates symbolic machine instructions into numeric code. Instructions are stored in memory. Memory of various size addressable units. To generate instructions SLIC MACHOP language translates.
Binary machine code generation is defined in the MACHOP language. The MACHOP language is used to define assemble instruction. Including data directive allocation and initial values. A MACHOP may define a computer executable instruction or an allocation of a constant or variable. MACHOPs define the generation of bit fields in bit addressable memory. Memory unit alignment is acomplished with a morg, modulo org, that aligns the plant location to a modulo boundary.
SYNTAX language
[edit]The syntax metalanguage is used in programming the parser. The goal of the SYNTAX program is analyzing an input character steam and transform it into an Abstract Parse Tree equivalent of the parsed input.
The syntax language more properly called a grammar language defines language constructs made up of symbols. Symbols of a language are character strings. They are defined using token rules or matched directly with a string match.
The following SLIC grammar for Niklaus Wirth's PL/0
program = block '.';
block = ("const" ident '=' number $(',' ident '=' number) ';' /.EMPTY)
("var" ident $(',' ident) ';' /.EMPTY)
( "procedure" ident ';' block ';' /.EMPTY)
statement;
statement = ( ident ":=" expression
| "call" ident
| "?" ident
| "!" expression
| "begin"
statement $(';'
statement)
"end"
| "if" condition "then" statement
| "while" condition "do" statement
|.EMPTY);
condition = "odd" expression
| expression ("="|"#"|"<"|"<="|">"|">=") expression;
expression = ('+'|'-'|.EMPTY) term $(("+"|"-") term);
term = factor $(("*"|"/") factor);
factor = ident | number | '(' expression ')';
GENERATOR
[edit]Generstor functions are used to produce sequentialized code. (CWIC GENERATOR language except changed to produce PSEUDO instruction instead of direct binary code output). TREE META like unparse rules combined with lisp 2 based procedural list processing language. PSEUDO code lists are generated by generator functions. The analysis and sequentialized code production is processor independent PSEUDO code.
PSEUDO
[edit]Pseudo instriction are used as an abstraction, defering actual binary machine code generation to the pseudo expansion phase.
The generator functions process abdtract parse trees emiting pseudo instructions. Pseudo instructions are a0n idealized target machine's operations usually designed for the implementation of the input language. A pseudo instruction is an executable function having a parameters list whose code body language is the dame LISP 2 dialect in the generator language. A pseudo is a lot like an assembly macro
allowed simplification of different processors machine code instruction sets and machine independent sequential optimizations. PSEUDO instruction are similar to macro definitions in assembly except for being executable code implemented in a LISP 2 based procedural language that calls MACHOP instruction defined in the MACHOP sub-language.
- ISO sub-language: (In Sequence Optimization) Used for target processor independent sequential optimization.
- MACHOP sub-language: used to define a target processors instructions, assembly opcode mnemonic and operands translation into binary bit fields, assembled into bit addressable memory sections.
SLIC was used to write a COBOL cross compiler running on a DEC-10 and producing code for a TI-990 (8 bit) BYTE addressable 16 bit word minicomputer. SLIC at the time was partially(about 80%) written in itself producing code for the DEC-10 a 36 bit word machine. The other 20% was MACRO10 assembly code.
Abstract Parse Tree Xxx
[edit](3•x-5)/(x+4) -> DIV / \ SUB ADD / \ / \ MPY 5 x 4 / \ 3 x DIV[SUB[MPY[3,x],5],ADD[x,4]]
The APT (Abstract Parse Tree) is an Abstract Syntax Tree structure created using transform operators :<node name> and !<number> in the parser language. The creation method is different then described in the Abstract Syntax Tree topic but otherwise they are functionally the same. Two operational stacks are used. An Abstract Parse Tree is constructed on the parse-stack from parse-stack and node-stack entries.
The :<node name> sequance creates a node object and pushes it onto the node stack. :ADD creates an ADD node. :MPY creates an MPY node. Tokens are placed on the parse stack by token rules. The !<number> operators is used to combine the top node stack entry with <number> of parse stack entries. It create a list whose first element is the top node stack entry. The <<number> parameter specifying the number of parse stack entries (branches) to be poped into the list. The list is then pushed onto the parse stack. APTs on the parse stack inturn become limbs of trees or are passed to functions coded in the generator language. The transform rules below perform the arithmetic expression transformation illistrated above.
expr = term $(('+':ADD|'-':SUB) term!2); term = factor $(('*':MPY|'/':DIV) factor!2); factor = '(' expr ')'|number|id;
An equivalent functional list notation:
(3•x-5)/(x+4) -> DIV[SUB[MPY[3,x],5],ADD[x,4]]
is commonly used in SLIC documentation.
A tree is a list whose first element is a node:
- [DIV,[SUB,[MPY,3,x],5],[ADD,x,4]]
Tree building is a stack automation consisting of a node stack and a parse stack. Tokens are placed on the parse stack by token rules. The : operator is used to push a node on the node stack. The !<num> pops the top node and combines it with the <num> number of parse stack entries. Poping the entries and pushing the created tree.
Abstract Parse Tree
The Abstract Parse Tree or just APT is a list whose first element is a node string. A tree of two branches x and y whose node is ADD:
ADD / \ x y
is displayed text formated as:
ADD[x,y]
or in list format:
[ADD,x,y]
The SLIC debugger can be set to display trees in either form. In the syntax language the operators : and ! are used construct APTs. The APT is built on and from the parse and node stacks. The : operator pushes the node name following it onto the node stack. The :::!::: operator takes a number following it and creates a list whose first element is poped from the node stack and combined with the top number of parse stack elements. The list is the new top element on the parse stack. The tree branches having been removed and placed in the constructed list. There are several methods of placing objects are on the parse stack. Besides the '!' number described, the next most common is a token rule. A sequence of parsed elements may inclosed by +[ and ] to form a list.
part 2
[edit]The :<node> operator is used to create and push a node object on the node stack. The !<number> operator builds a list whose first element is poped off the node stack. The top <number> of parse stack entries are poped filling out the list. The list is pushed on the parse stack. The transform rules below perform the arithmetic expression left-most tree derivations transformation illistrated above.
expr = term -$(('+':ADD|'-':SUB) term!2); term = factor $(('*':MPY|'/':DIV) factor!2; factor = '(' expr ')'|number|id;
Parsing the expression:
(3•x-5)/(x+4) ->
An equivalent functional tree is is built on the parse stack:
DIV / \ SUB ADD / \ / \ MPY 5 x 4 / \ 3 x
A tree is a list whose first element is a node:
- [DIV,[SUB,[MPY,3,x],5],[ADD,x,4]]
The !<num> pops the top <node name> on the node-stack and combines it with the top <number> of parse stack entries. Poping the entries and pushing the created tree onto the parse stack.
List operators +[ and ]+ are used to create a list of parsed items.
arg_list = +[argument, (',' argument)* ]+
+[ and ]+ are grouping operators that when enclosing a parse sequence create a list of the parsed items.
Source stream
[edit]The source input is a character stream. It is processed one character at a time. Input may be from files or a device. A keyboard for example. The default is from stdin. The command line is processed by SLIC which then opens the input and output files. The input stream is associated with one or more input files. The I/O routines buffer input only releasing buffers previous to the oldest backtrack point.
CWIC
[edit]Compiler for Writing and Implementing Compilers was developed at SDC (System Development Corporation) in the late 1960'S early 70's.[2]
Planting code
[edit]Plant or planting is the term oragionally used by CWIC for generating machine code in generator actions. The < and > group a plant sequence. Sections are a part of planting that provides named spaces. One could be for data. Another instructions. CWIC and SLIC both plant code into named sections and flush them. But the actual functional operations are quite different. Flushing writes code to the output file. A section is created with a section declaration.
SLIC sections
[edit]Plant or planting is appending pseudo instructions to a sections instruction list. The pseudo instructions are then executed on flushing the section. The SLIC plant construct simple held pseudo instructions:
< mov x,%T1 add 5,%T1 mpy %T1,y mov %T1,x >
mov, mpy,and add are pseudo procedures that upon flushing the section pseudo list will be executed to produce binary code. Once executed the list is cleared. On flushing In Sequence Optimizers may be run on the pseudo list. Usually comand line switches are used to set optimizations. Optionally sections may be executed living the pseudo list intact. A feature useful for inturpiters. A pseudo list may be assigned to a symbol attribute.
Compiler Organization
[edit]Syntax phase. Reads source code from the input stream transforming it into abstract parse trees that are passed to generator functions. A generator is a named list of unparse transforms. The unparse transforms generate pseudo code sequences. Several sequences can be generated in parallel to different named sections. The term used for emitting a pseudo instruction is planting. Sense we have named code sections the separating of code,constants and data is simple. We plant data pseudo instructions into a data section and executablep pseudo instructions into named code sections.
SLIC objects
[edit]All datum are dynamic objects. An object's pointer points at its value. Attributes are negative offsets from its pointer. Attributes are type dependent. A string has no attributes for example. Nodes and symbols have fixed and assignable attributes. Symbols have leaf and head pointer, def flag, type etc fixed attributed. Other attributes may be attached in assiciative lists A pseudo object would contain an array of parameters. It attributes would be it procedure entry and previous and next pointers. The SLIC dynamic memory manager is object type aware.
○ -n | object attributes | |
○ - | ••• | |
○ -2 | object attributes | |
○ -1 | type (-n, n element list) | positive values are atomic data types. i.e. strings, symbol strings, node name, integer, float, native float, fixed point, decimal, decimal fixed point, ••• |
○ object | object begins | object pointer points here. Negative offsets are attributes. positive are object data. |
○ + | ••• | |
○ +n | object |
String Processing Languages
[edit]In computer programming the unqualified term string usually refers to a sequence of characters. Computer input and output is commonly in the form of strings. Many common programming language have support for character data. Specialized string languages have been developed and used. Some common string processing languages:
Comit | 1957-58 |
SNOBOL | 1962-63 |
SNOBOL 3 | 1965 |
SNOBOL 4 | 1967 |
TRAC | 1964 |
AMBIT | 1964 |
Icon | 1990 |
Junk Pile
[edit](System of Languages for Implementing Compilers)
Token rules recognize and create objects of character sequences that make up the words, numbers, symbols and strings, of the language. Syntax rules recognize language structures transforming them to tree or list structure
SLIC's implementation of lists is as a dynamic array. Only having a header overhead. The header containing its negated size. i.e. -5 indicating a 5 element list. A positive value indicated a datum type. An object structure consisted of a header and its data. An object pointer is to its data. It's header a negative offset. A list having its negative length as its data type. Positive values indicating atomic objects. Positive values are of a fixed size and negative value the length of a list. The dynamic memory management is a vary important part. Efficiency of memory is vary important to compilation speed. For example a binary tree is a common structure. (A list consisting of a node and two branchs.) Maintaining some number of free blocks of 3 element list size speeds up compilation. Allocating a block contains of some multiple number of 3 element lists
xx | yy | zz |
xx0 | yy0 | zz0 |
SLIC was developed at Cerritos Collage, Norwalk California, in the period from 1969 to 1974.
Links of use
[edit]programming language implementation#Abstract machines
META II, TREE-META and CWIC.
Except SLIC's generator language produces PSEUDO instructions. PSEUDO instructions play the part of macro instructions as described above. CWIC called code generation planting. Planting in CWIC put 8 bit byte sequences into named memory block sections. One might have a code section and a data section for example. Plant constructs are inclosed in angle brackets. IBM 360 Add Register and Subtract Register instructions:
- <AR+(x*16)+y;>
- <SR+(x*16)+y;>
Generate 16 bit 2 byte instructions as AR and SR are defined as such. The memory blocks were written to the output file using a flush <section name> statement.
In SLIC "planting" appends PSEUDO instruction to a section's code list. Upon flushing, the list is executed. PSEUDOs are procedures that call MACHOPs to output machine code.
A GENERAL-PURPOSE TABLE-DRIVEN COMPILER. Stephen Warshall and Robert M. Shapiro
syntax program goal
[edit]The main goal of which is to match the complete program with the starting syntax rule. The start rule drives the parser describing a complete program. The name of a syntax rule appears at the start of the rule followed by the = symbol. The right-hand side of a rule indicates which entities are to be recognised to achieve this goal. The recogniser achieves its main goal by looking for these smaller entities, from left to right, as individual subgoals. These subgoals are themselves literals or other named rules. The process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic tokens are recognised. It starts with a top rule reducing complex language structures into simpler and simpler structures down to atomic elements.
cc
[edit]SLIC followed the tradition set by CWIC and other compilers of the day. It is a block structured sub-language using .BEGIN and .END ala ALGOL. It was common to use a non-identifier character to start a <rserved word>. cc breaks away from that using C braces {} Some words are reserved like if, else, then, etc. CWIC and previous meta compilers also grouped the specific rule types into divisions. CWIC required you declared syntax ruled to follow using
.SYNTAX
and generator using:
.Generator
With cc that is going a way. I am also thinking about using more of an overloaded function form defining generators. I prefer to have all the transform of a generator together. But during development it would be nice to have the unparse rule and the tree producing grammar rule close together in the source.
Notes
[edit]General notes to self. Personal info not to be used in final version.
- ^ The ISO phase was an after thought. Added after reading the Warshall and Shapiro paper.
References
[edit]- ^ Stephen Warshall; Robert M. Shapiro (April 23, 1964). A general-purpose table-driven compiler AFIPS '64 (Spring) Proceedings of the , spring joint computer conference. pp. 59–65. doi:10.1145/1464122.1464128. ISBN 978-1-4503-7890-1.
- ^ Book, Erwin; Dewey Val Schorre; Steven J. Sherman (June 1970). "The CWIC/36O system, a compiler for writing and implementing compilers". ACM SIGPLAN Notices. 5 (6): 11–29. doi:10.1145/954344.954345.
Cite error: There are <ref group=NOTE>
tags on this page, but the references will not show without a {{reflist|group=NOTE}}
template (see the help page).