The Picky programming language 6/9/11 Francisco J Ballesteros Laboratorio de Sistemas Universidad Rey Juan Carlos

ABSTRACT Picky is a programming language designed for use in a first level, introductory, programming course. The language is small and simple, and is strict regarding what is a legal program. This document describes the language.

1. Motivation Ada could be a good language for teaching, but it is quite verbose and utterly com­ plex. This makes things hard for students in introductory courses, because there are many different constructs to master. Picking a subset is not doable in practice, because many features left out still show up even for modest subsets. Type safety is a must, but automatic features (like automatic dereferencing of pointers) makes it unclear for stu­ dents what the code actually does. Also, control structures requiring exit when con­ structs are easily misused. File handling in Ada is clumsy, to say the least. For example, calling End_Of_File may block a program, reading from a terminal, and students will not know why. Furthermore, we teach that functions should not have lateral effects, but many file I/O tools are functions. Low level languages, like C, are not suitable at all. Type safety is a must and struc­ tured data including strong typing and range checks are good to have when learning how to program for a first time. Scripting languages do not enforce good practice, and have undesirable features in many cases. For example, including white space as part of the syntax (e.g., tabulators) or automatic declaration of variables. Object oriented languages are too complex for use as a first language. They may be popular, but they are not clean and look like magic to most students. Pascal is a good first language. However, its control syntax is verbose. Also, the language syntax is more complex than needed. For example, the use of semicolons as separators instead of terminators for sentences is a problem for students. They end up guessing when to add a semicolon and when not to add one. We wanted a language as simple as Pascal, with terse syntax (like C), and a realistic handling of file I/O. File I/O is important not just to perform I/O, but also to make stu­ dents learn how to use control structures to guide data consumption without violating file I/O rules imposed by the file abstraction. As a result, we designed a new language, called Picky. The language compiles to byte-code for an abstract machine called PAM. An inter­ preter for PAM code is supplied along with the compiler. This isolates students from portability issues that would arise otherwise.

­2­

When a kid learns how to ride a bicycle it is convenient to use side-wheels for a while. Only after such artifact is under control, a new bicycle (one without side-wheels, and perhaps with an engine) is more convenient. In the same way, Picky is highly restrictive regarding what can be done and what can not in a program. It has sidewheels attached. Both the compiler and the run time include extra checks and waste memory and time to provide additional safety features (e.g., more informative diagnos­ tics regarding accidental use of dangling pointers). 2. The language 2.1. Picky programs Picky has control structures reminiscent of C and data declarations in the style of Pascal. A source program is made of a single file. This is a hello world:

3

/* * */

5

program Hello;

7

procedure main() { writeln("hello, world"); }

1 2

8 9 10

Hello world

Comment syntax is taken from C. A program is introduced by a program clause (line 5) that assigns an identifier to the program. A program may have constant and type defini­ tions, variable declarations, procedure definitions and function definitions. A procedure named main must be included, like in C. The program starts executing its body and ter­ minates when returning from it. All declarations and statements are terminated by a semicolon, but note that proce­ dure and function definitions are not terminated by a semicolon. Constants, types, pro­ cedures, and functions may not be declared within the scope of a procedure or function. That is, subprograms may not be nested and constants and types must be declared in the global scope. The language is case-sensitive. Thus, main, Main, and MAIN are different identi­ fiers. An identifier must start with an alpha rune followed by zero or more alphanu­ meric runes. The following names are reserved and correspond to keywords, pre-defined vari­ ables, types, procedures, functions, and constants. All other names are available for new identifiers.

­3­

acos and array asin atan bool case char close consts cos fflush data default

dispose do else Eof Eol Esc exp False fatal feof feol if file float

flush for fpeek fread freadeol freadln frewind function fwrite fwriteeol fwriteln open int len

log log10 Maxchar Maxint Minchar Minint new nil not Nul of sqrt or peek

pow stdout pred succ procedure switch program Tab read tan readeol True readln types record vars ref while return write sin writeeol writeln stack stdin

A program starts with the program clause and must include a procedure with no parameters and named main, as shown. A program may aso include one or more constant declaration blocks, one or more type declaration blocks, one or more variable declaration blocks, and procedure and function definitions. The scope for a declaration goes from the point where it happens in the source to the end of file. Constant, type, and variables declaration blocks start with the keyword consts, types, and vars (respectively) followed by declarations. This program is an example: 1

program Xample;

3

consts: C1 = 11; Greet = "hi";

4 5 7 8 9 11 12 14 15 17 18 19 20

types: Tmonth = (Ene, Feb, Mar); Tyesno = bool; consts: Zmonth = Ene; vars: a: month; procedure main() { /* ... */ ; }

2.2. Constants Constants are defined like in the example. Constants for basic types have data types derived from their values, which may be expressions as long as their resulting value may be computed at compile time. Integer literals are digits, base 10, one after another. A leading plus or minus sign is actually an unary expression adjusting the sign of the following operand. Float (real) literals are digits with a decimal point and at least one more digit, perhaps followed by an exponential notation (i.e., an E an optional sign, and one or more digits). Boolean values are named True and False. Character literals are a single rune within single quotes. Array of character (string) literals are one or more runes within double quotes. These are some examples:

­4­

1 2 3 4 5 6 7 8

consts: C1 = 11; C2 = −2; C3 = 3.0; C4 = 4.3E10; Ok = True; X = ’X’; Msg = "hi";

/* /* /* /* /* /* /*

int */ int */ float */ float */ bool */ char */ array[0..1] of char */

Aggregates are discussed later, along with arrays and records. 2.3. Basic data types Picky is strongly typed. Too strongly, hence its name. Basic types are bool, char, int, float, and file. They correspond to booleans, characters, integers, real numbers in float­ ing point, and external (text) files. Two types are compatible (for assignment and other operators) only if they have the same name. Predefined types also obey this rule. Constants and literals are an exception, they belong to universal types that are assumed to be compatible with any basic data type of the same kind. This is reasonable, for example, to permit using inte­ ger literals in expressions that belong to a user defined integer type. Another exception are subranges. Subranges do not introduce a new type; they declare a restriction defin­ ing a subset of an existing type. A type definition defines a new type and declares its name. For example 1 2 3

types: Apples = int; Oranges = int;

defines two new types: Apples and Oranges. It is not legal to mix apples with oranges, and it is not legal to mix any of them with int values. However, integer constants and lit­ erals may be mixed with any of them. 2.4. Predefined variables and constants There are several constant character values defined: Eof (representing the end of file), Eol (representing the end of line), Tab (tabulator), Esc (escape), and Nul (null byte). Constants Maxint and Minint report the maximum and minimum values for the int data type. Like Maxchar and Minchar do for the char data type. Predefined variables named stdin and stdout, of type file, exist for standard input and output. The special value nil is predefined and represents a null pointer. It is type compati­ ble with any pointer type. 2.5. Operators and builtin operations We describe here the operators available in the language (but for the len operator, which is discussed along with structured data types). For binary operators, both operands must be type compatible. The resulting type is always of the same type of the arguments, but for obvious exceptions (i.e., relational operators always yield bool val­ ues). Values of data types other than file may be compared using equality operators: ___________________________ ___________________________ Operator  Meaning    Equal to  == ___________________________   != ___________________________  Not equal to 

­5­

Equality yields True if and only if values are equal. Inequality yields True if and only if values are not equal. For structured types (described later), these operators compare their inner elements, one by one. Values of ordinal data types (that is, bool, char, int, and user defined enumera­ tions) have fixed positions in their abstract sets, and may be compared using the follow­ ing: ___________________________________ ___________________________________  Operator  Meaning    < Less than ___________________________________   > ___________________________________  Greater than  ___________________________________  Less or equal than  = ___________________________________  Greater or equal than  Ordinal values have two more functions defined: _______________________________ _______________________________  Built−in  Meaning  pred(v)  Predecessor of v  _______________________________   succ(v)  Successor of v  _______________________________ Pred yields the predecessor of v in the data type. Succ yields the successor of v in the data type. Boolean values accept usual boolean operators: ____________________________________ ____________________________________  Operator  Meaning  and  binary logical and  ____________________________________   or ____________________________________  binary logical or  ____________________________________  unary logical negation  not    And and or evaluate both operands. That is, there is no short-circuit evaluation as found in C. Numeric data types accept the following operators, their operands must be type compatible, as usual. Not all operators are defined for both integers and floating point numbers (the table shows legal operand types). _________________________________________________________________________ _________________________________________________________________________  Argument types  Operator  Meaning     + binary addition or unary nop float int _________________________________________________________________________    − float int _________________________________________________________________________  binary subtraction or unary sign change   _________________________________________________________________________  binary multiplication   * float int     / float int _________________________________________________________________________  binary division   _________________________________________________________________________  binary modulus   % int   binary exponentiation   ** float int _________________________________________________________________________    Expressions may be parenthesized as required. The precedence of operators is indicated by the following table, from low to high precedence. Operators in the same row have the same precedence. All operators associate to the left. Expressions are evaluated left to right.

­6­ ________________________________ ________________________________  Precedence    or and   == != < > =     + − (binary)  low     * / %    **   high   + − (unary)    len not ________________________________   The len operator returns the number of elements in the object given as an argument. It is discussed later, in the section for structured types. The following functions are defined for float arguments, and yield a float result. They inherit their names and behavior from C, so we do not describe them any further. _________________________________________ _Function  Meaning  ________________________________________    acos(r) arc-cosine _________________________________________   asin(r) _________________________________________  arc-sine  _________________________________________  arc-tangent  atan(r)    cos(r) _________________________________________  cosine  _________________________________________  exponential  exp(r)   logarithm  log(r) _________________________________________   log10(r) _________________________________________  base 10 logarithm  _________________________________________  pow(r1, r2)  power    sin(r) _________________________________________  sine  _________________________________________  square root  sqrt(r)    tan(r) _________________________________________  tangent  The following functions are defined to perform I/O. Some of them operate on stdin or stdout, others operate on the file given, as indicated. The argument obj may be a value or l-value of any basic type (i.e., non structured type), and it may be also an array of char.

­7­ ________________________________________________________________________________________________ ________________________________________________________________________________________________  Proc/Func   Built−in Meaning     close(file) procedure Close the file ________________________________________________________________________________________________    eof() ________________________________________________________________________________________________  function  Report if Eof has been met in stdin  ________________________________________________________________________________________________    eol() function Report if Eol has been met in stdin     feof(file) ________________________________________________________________________________________________  function  Report if Eof has been met in file  ________________________________________________________________________________________________  function  Report if Eol has been met in file  feol(file)     fflush(file) procedure Flush the output buffer for file ________________________________________________________________________________________________    flush() ________________________________________________________________________________________________  procedure  Flush the output buffer for stdout  ________________________________________________________________________________________________  procedure  Look ahead next char from file, or Eof, or Eol  fpeek(file, char)     fread(file, obj) ________________________________________________________________________________________________  procedure  Read object from text representation in file  ________________________________________________________________________________________________  procedure  Idem, and skip the rest of line (and Eol)  freadln(file, obj)     freadeol(file) procedure Read end of line from file ________________________________________________________________________________________________    frewind(file) ________________________________________________________________________________________________  procedure  Seek to start of file  ________________________________________________________________________________________________    fwrite(file, obj) procedure Write text representation for object in file     fwriteln(file, obj) ________________________________________________________________________________________________  procedure  fwrite(file,obj); fwriteeol(file);  ________________________________________________________________________________________________  procedure  Write end of line in file  fwriteeol(file)  open(file, name, mode)  procedure  Open file with given name for mode (which      ________________________________________________________________________________________________   may be "r", "w", or "rw")  ________________________________________________________________________________________________  procedure  Look ahead next char from stdin, or Eof, or Eol  peek(char)   procedure  Read object from text representation in stdin  read(obj) ________________________________________________________________________________________________    readln(obj) ________________________________________________________________________________________________  procedure  Idem, and skip the rest of line (and Eol)  ________________________________________________________________________________________________  procedure  Read end of line from stdin  readeol()     write(obj) ________________________________________________________________________________________________  procedure  Write text representation for object in stdout  ________________________________________________________________________________________________  procedure  write(obj); writeeol();  writeln(obj)     writeeol() ________________________________________________________________________________________________  procedure  Write end of line in stdout  L-values of pointer types may use the following builtins to allocate and deallocate mem­ ory. _____________________________________________________________________________ _____________________________________________________________________________  Proc/Func   Built−in Meaning  dispose(ptr)  procedure  Dispose memory referenced by ptr  _____________________________________________________________________________    new(ptr) _____________________________________________________________________________  procedure  Set ptr to point to newly allocated memory  Three other built-ins are provided for debugging and abnormal termination. __________________________________________________________________ __________________________________________________________________  Proc/Func   Built−in Meaning  fatal(text)  procedure  Print text and abort execution  __________________________________________________________________    stack() __________________________________________________________________  procedure  Dump the stack for debugging  __________________________________________________________________  procedure  Dump global data for debugging  data()     2.6. Type casts In general, the language does not permit type casts. However, type casts are permitted to convert ordinals to the integer representing their position in the type and vice-versa. Also, integers may be converted to floating point numbers and vice-versa. To convert a value to a type use the target type name as a function. For example, these are legal expressions:

­8­

char(int(’A’) + 1) float(3) int(4.2)

2.7. Basic type definitions A new type may be defined as new instance of an existing type by using the existing type as its definition. For example, 1 2 3

types: Apples = int; Oranges = int;

Enumerated types are also ordinal types, and are defined by enumeration of their literals as in the example: 1 2 3

types: Month = (Jan, Feb, Mar); Yesno = (No, Yes);

Line 2 introduces both the Month data type and new literals Jan, Feb, and Mar. Subranges of existing ordinal data types (i.e., bool, char, int, and enumerated data types) may be declared. Subranges do not introduce a new data type. They introduce a range limit for an existing type, and remain type compatible with that type. Ranges are checked at run-time and may lead to a program panic if not obeyed by the user code. A subrange is defined by naming the actual type and the range, as in this example: 1 2 3

types: Mrange = Month Jan..Feb; Letter = char ’a’..’z’;

2.8. Structured Types Array types may be declared using an ordinal type (usually a subrange) as an index specifier and any other type as the element specifier. For example: 1 2 3

types: Days = array[Month] of int; Days2 = array[Jan..Feb] of int;

There is no data type for strings. Instead, an array of characters indexed by integers starting with 0 is used. The syntax does not allow to nest definitions for data types. Only in the range index specifier can be nested, instead of defining a type name and then using it. This enforces the policy of declaring type names for inner components of structured data. As a result, multi-dimensional arrays require defining the type for a row or column (in n-1 dimensions) and then the type for the array, using the previous one as the element type. Syntax to refer to array elements is as expected in C-like languages: days[Jan] matrix[3][2]

Record (or structure, or tuple) types may be declared using the record keyword and a bracketed list of field declarations. As in this example:

­9­

1 2 3 4 5 6 7 8 9 10 11 12 13 14

program Example; types: Prange = int 1..10; Point = record { x: int; y: int; }; Points = array[Prange] of Point; Poly = record { points: Points; npoints: int; };

It is feasible to switch on a value of a enumerated-type field to define some fields only for particular values of that switch-field. For example: 1 2 3 4 5 6 7 8 9 10 11 12 13

Cmd = record { code: Code; kind: Kind; switch(kind){ case Rangecmd: r: Rangetype; case Recmd, Strcmd: s: Str; case Intcmd: i: int; } };

In this case, the field s is available only when the field kind has either Recmd or Strcmd as values. For values of kind other than Rangecmd, Recmd, Strcmd, and Intcmd, the only fields of Cmd are: code and kind. As explained before, type definitions may not be nested. For example, it is impera­ tive to define the types Point and Points in this example before defining Poly. Other­ wise, members of Poly couldnt be arrays or records. Only Prange might be avoided, by using the range directly in the definition of Points. Syntax for member access is as expected, using the dot notation. For example: poly.points[1].x

The operator len may be used with a type, variable, or constant name to yield the num­ ber of members of the given object or type. For example, len Points

would be the integer value 10 in the previous example. This operator is evaluated always at compile time and does not evaluate its arguments. 2.9. Aggregates For arrays and records, literal values may be constructed using the type name as a (con­ structor) function and supplying as arguments values of appropriate types for each one of the members, in the order used in the type definition. An aggregate value may be used in any place a value of the corresponding type may be used, including constant definition and subprogram arguments. For example:

­ 10 ­

types: Arry = array[0..1] of char; Word = record{ chars: Arry; n: int; };

1 2 3 4 5 6

consts: Greet = Word("hi", 2);

8 9

2.10. Pointers A pointer data type refers to another type and permits using new and dispose to handle dynamic variables of the pointed-to type. Type definition uses the ^ notation, taken from Pascal: types: Arry = array[1..10] of int; Iptr = ^int; Aptr = ^arry;

1 2 3 4

Line 2 declares an array data type used in line 4, to declare a pointer to Array data type. Line 3 declares a pointer to integer. It is legal to declare a pointer to a type that is not yet defined in the program, but the target type must de defined later. This permits declaring circular data types, like linked lists. In no other case may a type be defined in terms of not yet defined types. Syntax to dereference a pointer value is taken from Pascal, and also uses the ^ sign: iptr^ = 2; aptr^[1] = iptr^;

All memory allocated with new must be released by calling dispose before completion of the program, or the program will abort and report memory leaks. 2.11. Procedures and functions Procedures are actions with names and do not return values. Argument passing is byvalue by default. Multiple arguments are declared separated by commas. Using the key­ word ref before an argument name makes pass-by-reference active for that parameter. For example, 1 2 3 4

procedure initword(ref w: Tword) { w = nil; }

defines a procedure with a single argument, passed by reference, of type Tword. Instead, 1 2 3 4

procedure addtoword(ref w: Tword, c: char) { ... }

defines a procedure with two arguments. w is of type Tword and passed by reference. However, c is of type char and is passed by value. Functions are declared in a similar way, using the function keyword and declaring the return type like in this example:

­ 11 ­

1 2 3 4

function isblank(c: char): bool { return c == ’ ’ or c == Tab or c == Eol; }

All function arguments must be passed by value. All in all, we teach that functions should have no lateral effects and should preserve referential transparency. 2.12. Global and local variables Global variables are declared like types and constants, with a declaration block. In this case, the keyword vars must be used instead. For example: 1 2 3 4 5 6 7

program Xample; vars: n: int; procedure main() { ... }

The declaration uses the pascal colon syntax. Unlike in Pascal, it is not allowed to declare a type on the fly in the variable declaration. A type identifier is required after the colon. Also, there is no initialization syntax, by design. Variable initialization must hap­ pen in the body of procedures and functions. All variables are initialized to random values. That means that it is unlikely to find them zeroed even the first time they are used. Local variables are declared within the procedure or function header and its body. In this case, the vars declaration specifier is not used. Procedures and functions may not contain constant or type definitions and so, declarations always refer to (local) variables. This example declares a local variable named f: 1 2 3 4 5 6

function fact(n: int): int f: int; { ... return f; }

2.13. Statements Statements are not expressions (like in C), but actions (like in Pascal). They must be ter­ minated by a ;. The null statement is just the ;, on its own. Statement blocks are enclosed by curly brackets, as it has been seen for procedure and function bodies, which are blocks. Assignment uses the = operator, like in C. For example: x = 0;

Needless to say that arguments must be type compatible and that the left part must be an L-value. Function calls are not allowed as statements, because they are expressions. Proce­ dure calls are allowed as statements (and not in expressions), and use the obvious syn­ tax: 1 2 3

write(3); writeln(); fwrite(stdout, Eol);

­ 12 ­

If there are no arguments, parenthesis must still be supplied. The statement return returns a value from a function, like in the example of the previous section. It is required that return is the last statement in the function body. Early returns are not allowed. It is permitted to use a conditional as the last statement in a function, as long as all its arms include a return statement as their last sentence. Procedures may not use return. 2.14. Control structures. Conditional execution is controlled by the if statement, which borrows syntax from C. But there are differences. Statements used for then and else arms must be blocks. That is, brackets must be used always. For example: 1 2 3

if(len(w) > len(max)){ max = w; }

or 1 2 3 4 5

if(c == ’ ’ or c == ’ read(c); }else if(c == Eol){ readeol(); }

’){

Multiple if statements may be chained by using an if statement directly in the else of a previous if. 1 2 3 4 5

if(c == ’ ’ or c == ’ read(c); }else if(c == Eol){ readeol(); }

’){

while and do−while loops borrow the syntax from C: 1

do{

2

read(c); }while(not eof() and isblank(c));

3

and 1 2 3 4

while(w != nil){ tot = tot + w^.len; w = w^.next; }

The for loop reminds to that of C, but has semantics closer to Pascal. Two expressions, an initialization and a condition, are present within parenthesis in the loop header. The initialization must be an assignment for a variable of an ordinal type. The condition must use any of the = operators. The first two ones make the vari­ able increase automatically after each iteration. The last two ones make the variable decrease automatically after each iteration. For example: 1 2 3

for(i = 0, i < Nitems){ write(item[i]); }

After the for loop, the control variable would be equal to the value on the right of the condition. This implies that there is no out of range condition for the control variable even when using = with the first or last valid value of an ordinal type. In

­ 13 ­

our example, i value would be Nitems when the loop is done. Multi-way conditionals use a switch syntax that reminds to (but differs from) that in C. Unlike in C, there is no fall-through; and there is no break statement. Expressions used in each case may be single values (of an ordinal type), or multiple values separated by commas (matching any of the arguments), or a range using the dot−dot notation. For example: 1 2 3 4 5 6 7 8 9 10

switch(4){ case 3,4..8: c = True; case 1..4: c = True; case 5: c = True; default: ; }

3. The compiler The picky compiler, pick, is implemented in C for Plan 9 as of today. Ports to Linux, Win­ dows and MacOS X are available. The description of the compiler provided in this section corresponds to an early version of the implementation. It is meant to provide a hint to people that must modify the compiler, but it is not up to date with respect to the imple­ mentation. The language description of previous sections is, of course, up to date. The compiler is implemented using yacc, and should be easy to understand. There are several things to know before attempting to modify it, which are documented here. The compiler leaks memory. Programs are expected to be small, and we prefer compilation to be fast and the compiler to be robust. Therefore, data structures are sel­ dom deallocated. Allocators for data structures request Aincr items at once when exhausted, and they never release memory. Symbol table handling as implemented is fast enough, but it is both simple and clumsy, and is the first thing that should be improved if more work is put in the com­ piler. There are no warnings. All diagnostics correspond to compile time errors. In many cases, when an error is detected, a symbol or node in the syntax tree is still built, for safety; other parts of the compiler still get a data structure as expected, and its less likely that an invalid value causes a bug. 3.1. Symbol table The symbol table is implemented as a stack of environments /* * One per program, procedure, and function. * Used to keep symbols found in it and also to collect * definitions for arguments, constants, types, variables, and statements. */ struct Env { ulong id; Sym* tab[Nhash]; /* symbol table */ Env* prev; /* in stack */ Sym* prog; /* ongoing program, procedure, or function */ Type* rec; /* ongoing record definition */ };

­ 14 ­

The global env points to the top of the stack. There is an initial environment used for the top-level (the outer scope). Another environment is pushed for each procedure, function, argument list, and record field list that is found. In some cases, the attributes in the grammar are not used to populate a node in the syntax tree. Instead, the global env is accessed to locate the procedure, function, or program being defined. The same is done to define fields for records. In most other cases, attributes as handled by yacc suffice. Each environment is a hash table that keeps symbols for the compiler. Two addi­ tional hash tables are kept. One to store strings and another to store keywords. static Sym *strs[Nbighash]; static Sym *keys[Nhash];

/* strings and names */ /* keywords and top−level */

The former is used to keep an entry for each name found in the source. For simplicity, it maintains Syms and not strings. The later is used to keep keywords and global defini­ tions. The scanner (done by hand) looks up in these tables to learn if a token for a key­ word should be given to the parser. In most other cases, it allocates a new entry in the strings table and returns its symbol. The grammar uses different tokens for identifiers and type identifiers. Therefore, the scanner checks if an (already defined) identifier is for a type or for any other value. A symbol is represented by this data structure. For simplicity, the same data struc­ ture is used to correspond to nodes in the syntax tree for expressions, albeit strictly speaking they are not symbols. /* * Symbol table entry. */ struct Sym { ulong id; char* name; Sym* hnext; int stype; int op; char* fname; int lineno; Type*

type;

­ 15 ­

union{ int tok; long ival; double rval; char* sval; struct{ int used; int set; }; struct{ /* binary, unary */ Sym* left; Sym* right; }; struct{ /* Sfcall */ Sym* fsym; List* fargs; }; struct{ /* "." */ Sym* rec; Sym* field; }; Prog* prog; }; /* backend */ union{ ulong ulong };

addr; off; /* fields */

};

The union(s) correspond to attributes for the symbol and backend information. In gen­ eral, a symbol has a name, belongs to a type of symbol (stype) and depending on the type may correspond to one operation or another (op). These are the types of symbols known: /* symbol types and subtypes */ Snone = 0, Skey, /* keyword */ Sstr, /* a string buffer */ Sconst, /* constant or literal */ Stype, /* type def */ Svar, /* obj def */ Sunary, /* unary expression */ Sbinary, /* binary expression */ Sproc, /* procedure */ Sfunc, /* function */ Sfcall, /* procedure or function call */

Symbols used to represent expressions carry in op the operation for the node:

­ 16 ­

Onone = 0, Ole, Oge, Odotdot, Oand, Oor, /* 5 */ Oeq, One, Opow, Oint, Onil, /* 10 */ Ochar, Oreal, Ostr, Otrue, Ofalse, /* 15 */ Onot, Olit, Ocast, Oparm, Orefparm, /* 20 */ Olvar, Ouminus,

In some cases, a symbol keeps a list of symbols as children. In all such cases, a List structure is used: struct List { int nitems; int kind; union{ Stmt** Sym** void** }; };

stmt; sym; items;

where kind must be any of /* List kinds */ Lstmt = 0, Lsym,

For example, argument lists are lists of kind Lsym, and statement blocks are lists of kind Lstmt. An important symbol type is that for programs (and procedures and functions). It holds a Prog structure as its value, also linked from the corresponding Env structure.

­ 17 ­

struct Prog { Sym* psym; List* parms; Type* rtype; List* consts; List* types; List* vars; List* procs; Stmt* stmt; Builtin *b; int nrets;

/* ret type or nil if none */

/* backend */ Code code; ulong parmsz; ulong varsz; };

The parser adds new symbols to the lists of constants, types, variables, and procedures/functions, as new elements are analyzed in the source. The single stmt is a block for the body of the procedure or function. For built-ins, b keeps a Builtin structure used to decorate the parser node with attributes and to encode the type signature. struct Builtin { char *name; u32int id; int kind; char *args; char r; Sym* (*fn)(Builtin *b, List *args); };

3.2. Data types Each symbol is expected to have a type attached. The type is described by this data structure:

­ 18 ­

/* * Types */ struct Type { int op; Sym* sym; int first; int last; union{ List* lits; Type* ref; Type* super; struct{ Type* idx; Type* elem; }; List* fields; struct{ List* parms; Type* rtype; }; };

/* Tenum */ /* Tptr */ /* Trange */ /* Tarry, Tstr */

/* Trec */ /* Tproc, Tfunc */

/* backend */ ulong id; ulong sz; };

Type constructors allocate new structures. Two types are compatible if their address in memory are the same. Exceptions are made to support universally compatible data types, as used for constants. The op field in type identifies the kind of type. It is any of: /* Type kinds */ Tundef = 0, Tint, Tbool, Tchar, Treal, Tenum, /* 5 */ Trange, Tarry, Trec, Tptr, Tfile, /* 10 */ Tproc, Tfunc, Tprog, Tfwd, Tstr, /* 15; fake: array[int] of char; but universal */

Type Twd is used to temporarily define a type as a forward declaration. This is used for pointers, which permit the target type to be defined later. Type Tstr is an artifact, to represent strings which are type-compatible with arrays of characters of the same length. All ordinal types have their first and last values stored in their Type structure. This is to perform range checks without paying attention to the difference between types and subtypes (only subranges as of today).

­ 19 ­

3.3. Statements Statements are described by stmt structures: /* * Statements */ struct Stmt { int op; char* sfname; int lineno; union{ List* list; /* ’{’ */ struct{ /* = */ Sym* lval; Sym* rval; }; struct{ /* IF */ Sym* cond; Stmt* thenarm; Stmt* elsearm; }; Sym* fcall; /* FCALL */ struct{ Sym* expr; /* RETURN, DO, WHILE, CASE */ Stmt* stmt; }; }; };

The op field identifies the kind of statement. A token representative of the statement is used for this purpose. The union keeps the information describing the statement. Statements for for loops are rewritten as a block that contains the initialization, a while loop, and its body adjusted to include the increment or decrement for the control variable. Switch statements are also rewritten, to use a sequence of chained if−then−else statements, each one checking the value of the expression we are switching on. To pre­ vent multiple evaluation of the switch expression, a variable is declared by the compiler for each such statement. The switch is rewritten to initialize the variable with the value of the expression, and then execute the chained if corresponding to the branches. 3.4. Builtins and predefined identifiers. Builtin procedures and functions have type signatures generated from a description string within the front-end. Arguments are checked by a generic builtin type check func­ tion, which takes into account the polymorphic nature of procedures like write. Builtin functions check to see if their arguments are evaluated as a result of con­ structing their nodes in the front-end. In that case, if the builtin may yield a value at compile time, the function call is replaced by the resulting value. The implementation tries to check if arguments are legal (e.g., would cause a floating point exception) and issue a sensible diagnostic otherwise. This process is guided by a Builtin structure as shown before. Calls to file procedures and functions that operate on stdin and stdout are rewritten to pass the file explicitly, using the variants of the builtins that accept a file argument. Pre-defined constants and variables are added to the environment for the top-level scope as soon as the parser tries to declare a program. Afterwards, they are handled like user defined objects.

­ 20 ­

3.5. Code generation Code generation is straightforward, and uses back-patching to set label addresses. Pro­ cedure are called by procedure number, and not by procedure addresses. Therefore, this mechanism is not applied in this case. Code is generated in blocks (one per procedure), using this structure: /* generated code */ struct Code { u32int addr; Pcent* pcs; Pcent* pcstl; u32int* p; ulong np; ulong ap; };

Here, p is the pointer to byte-codes (actually using a full u32int each); np is the number of byte-codes (words) produced, and ap is the number of byte-code slots (words) avail­ able in p. For each statement, and for symbol and expression nodes, entries to match pro­ gram counter to source file and line are linked into the code structure. /* pc/src table */ struct Pcent { Pcent* next; Stmt* st; Sym* nd; ulong pc; };

Either st or nd is used, not both at the same time. 4. The interpreter The description of the interpreter provided in this section corresponds to an early ver­ sion of the implementation. It is meant to provide a hint to people that must modify the interpreter, but it is not up to date with respect to the implementation. The language description of early sections is, of course, up to date. The interpreter, pam, implements an abstract machine known as PAM. The machine is a stack based machine. Most operations take arguments from the stack and replace them with a result, pushed also on the stack. There is a single flow of control, guided by an (almost) endless loop switching on the instruction type. The interpreter leaks memory for storage allocated with new, to detect when dis­ posed data structures are used and issue more descriptive diagnostics than segmenta­ tion violation. Also, it checks that assigned values are in range, more often than needed, to try to detect constraint errors early in the execution. All memory, both data, stack variables, and dynamic memory, is initialized with random values, to let the user discover early that variable initialization is missing. Such random values are always odd, to recognize pointer values not initialized, and issue a descriptive diagnostic for that case at run time, instead of a segmentation violation or producing a heisen-bug.

­ 21 ­

4.1. PAM PAM is the Picky Abstract Machine. It has the following elements: 

Some registers: pc

Program counter. Addressing words, each one a byte-code.

fp

Frame pointer. Addressing bytes. To locate the activation frame for the cur­ rent procedure.

sp

Stack pointer. Addressing bytes. To locate the top of the stack.

vp

(Local) Variable pointer. Used to translate local variable addresses into actual memory addresses.

ap

Argument pointer. Used to translate local argument addresses into actual memory addresses.

pid Procedure identifier. Used to locate the descriptor for the procedure executing (or function). 

Text memory. Word addressed area of memory used to keep byte codes. Each byte code is a word, not a byte. Operations taking an argument use another word for the argument. The pc register indexes this memory, starting at 0.



Stack memory. Byte addressed area of memory containing global variables (bottom of stack) and activation frames for procedures and functions. Stack addresses are machine addresses (i.e., actual addresses as used by the C implementation of PAM). All of sp, fp, vp, and ap point into this memory (i.e., they are actual C point­ ers in the implementation).



Dynamic memory. Dynamic variables are stored using the underlying C heap. How­ ever, pointer values are references to descriptors that refer to the actual memory allocated. This is used as a fence to detect run time errors in user pointers, to issue diagnostics that help.



Procedure descriptors. An array indexed by procedure identifier containing meta­ data for procedures and functions.



Type descriptors. An array indexed by type identifier containing descriptions for types, both built-in and user defined types.



Variable descriptors. An array indexed by variable identifier containing metadata for variables (e.g., their type identifiers).



Program counter entries. An array mapping program counters to source file names and line numbers.

A procedure descriptor contains this information: struct Pent { char *name; ulong addr; int nargs; int nvars; int retsz; int argsz; int varsz; char *fname; int lineno; Vent *args; Vent *vars; };

/* /* /* /* /* /* /*

for procedure/function */ for its code in text */ # of arguments */ # of variables size for return type or 0 */ size for arguments in stack */ size for local vars in stack */

/* Var descriptors for args */ /* Var descriptors for local vars. */

A type descriptor contains enough to perform range checks, learn how to read values for the type, or write values for the type, learn the size for objects, and handle or dump

­ 22 ­

objects for debugging. struct Tent { char *name; /* of the type */ char fmt; /* value format character */ long first; /* legal value or index */ long last; /* idem */ int nitems; /* # of values or elements */ ulong sz; /* in memory for values */ uint etid; /* element type id */ char **lits; /* names for literals */ Vent *fields; /* only name, tid, and addr defined */ };

A variable descriptor is used to describe variables, mostly for debugging and stack dumps. struct Vent { char *name; /* of variable or constant */ uint tid; /* type id */ ulong addr; /* in memory (offset for args, l.vars.) */ char *fname; int lineno; char *val; /* initial value as a string, or nil. */ };

Program counter entries have this information. Some fields are used to report leaks after program completion. struct Pc { ulong pc; char *fname; ulong lineno; Pc* next; /* Pc with leaks; for leaks */ uint n; /* # of leaks in this Pc; for leaks */ };

4.2. Instruction set An instruction has two fields: an instruction code and an instruction type. The former describes the instruction. The later describes if it handles integers, floats, or memory addresses (in those cases when the instruction can do several of them). This is the instruction set: add addr and arg call cast castr

daddr data datar div divr eq eqa

eqm eqr fld ge ger gt gtr

idx ind jmp jmpf jmpt le ler

lt ltr lvar minus minusr mod modr

mul mulr ne nea nem ner nop

not or pow ptr push pushr ret

PAM instructions are described by this enumeration (explained later).

sto stom sub subr

­ 23 ­

/* instruction code (ic) */ ICnop = 0, /* nop */ ICle, /* le|r −sp −sp +sp */ ICge, /* ge|r −sp −sp +sp */ ICpow, /* pow −sp −sp +sp */ IClt, /* lt|r −sp −sp +sp */ ICgt, /* gt|r −sp −sp +sp */ ICmul, /* mul|r −sp −sp +sp */ ICdiv, /* div|r −sp −sp +sp */ ICmod, /* mod|r −sp −sp +sp */ ICadd, /* add|r −sp −sp +sp */ ICsub, /* sub|r −sp −sp +sp */ ICminus, /* minus|r −sp +sp */ ICnot, /* not −sp +sp */ ICor, /* or −sp −sp +sp */ ICand, /* and −sp −sp +sp */ ICeq, /* eq|r|a −sp −sp +sp */ ICne, /* ne|r|a −sp −sp +sp */ ICptr, /* ptr −sp +sp */ /* obtain address for ptr in stack */ ICargs,

/* those after have an argument */

ICpush=ICargs, /* push|r n +sp */ /* push n in the stack */ ICindir, /* indir|a n −sp +sp */ /* replace address with referenced bytes */ ICjmp, /* jmp addr */ ICjmpt, /* jmpt addr */ ICjmpf, /* jmpf addr */ ICidx, /* idx tid −sp −sp +sp */ /* replace address[index] with elem. addr. */ ICfld, /* fld n −sp +sp */ /* replace obj addr with field (at n) addr. */ ICdaddr, /* daddr n +sp */ /* push address for data at n */ ICdata, /* data n +sp */ /* push n bytes of data following instruction */ ICeqm, /* eqm n −sp −sp +sp */ /* compare data pointed to by addresses */ ICnem, /* nem n −sp −sp +sp */ /* compare data pointed to by addresses */ ICcall, /* call pid */ ICret, /* ret pid */ ICarg, /* arg n +sp */ /* push address for arg object at n */ IClvar, /* lvar n +sp*/ /* push address for lvar object at n */ ICstom, /* stom tid −sp −sp */ /* cp tid’s sz bytes from address to address */ ICsto, /* sto tid −sp −sp */ /* cp tid’s sz bytes to address from stack */ ICcast, /* cast|r tid −sp +sp */ /* convert int (or real |r) to type tid */

­ 24 ­

/* instr. type (it) */ ITint = 0, ITaddr = 0x40, ITreal = 0x80, ITmask = ITreal|ITaddr,

All instructions above ICargs (which is not an instruction) do not have a following argu­ ment in the program text. A single word contains the entire instruction. Those below use a following word to contain the argument for the instruction. Instructions that have a suffix |r in their comment have a variant that knows how to handle reals. For example, the entry for ICpush means that there are two instruc­ tions: push and pushr. The former pushes an integer value (the argument) in the stack. The later pushes a float value in the stack. Instructions with the suffix |a have a variant that handles addresses. All atomic values in the stack (booleans, characters, integers, and floats) occupy a single word (32 bits). Addresses use 64 bits, to simplify execution in 64 bit environ­ ments. That is, addresses may be actual pointers. For example, there are three eq instructions: eq, eqr, and eqa: They compare integers, floats, and addresses (respec­ tively). Besides the argument in the program text, most instructions operate with stack arguments (and pop them off the stack) and push results back into the stack. This is represented by the +sp (push) and −sp in the description. Each one of the latter refers to a single argument taken from the stack. 4.3. Builtins Builtin procedures and functions have addresses that are not procedure ids. Instead, they have the PAMbuiltin bit set and contain a builtin number in remaining bits: /* Builtin addresses */ PAMbuiltin = 0x80000000, /* builtin numbers (must be |PAMbuiltin) */ PBacos = 0, PBasin, PBatan, PBclose, PBcos, PBdispose, /* 0x5 */ PBexp, PBfatal, PBfeof, PBfeol, PBfpeek,

/* 0xa */

PBfread, PBfreadeol, PBfrewind, PBfwrite, PBfwriteln,

/* 0xf */

­ 25 ­

PBfwriteeol, PBlog, PBlog10, PBnew, PBopen,

/* 0x14 */

PBpow, PBpred, PBsin, PBsqrt, PBsucc,

/* 0x19 */

PBtan, PBstack, PBdata,

The arguments for each builtin do not always match those supplied by the user. For example, file I/O procedures carry a type id besides the object or value to let PAM know how to read and write the argument (i.e., which is is its type descriptor). This is not doc­ umented here. See the implementation for the builtins in pilib.c. 4.4. Binary files. A PAM binary is indeed a PAM assembly file and not a binary. It is a text file, both for debugging and for portability and pedagogical purposes. The file must start with #!/bin/pi

Lines starting with # are ignored. The second line must report the procedure id for main: entry 3

for example. Following this, there are different sections for types, variables (and con­ stants), procedures, text, and PC/source entries. Each section starts with a line that has the keyword types, vars, procs, text, and pcs (respectively) followed by the number of entries in the section. Each entry is a descriptor (see above) or a text instruction (per­ haps with an argument in the same line). Descriptors have the information shown in the structures found before in this doc­ ument. Instructions have their address, instruction code (mnemonic, actually) and argu­ ment if any. The compiler adds comments in the assembly file to match PAM instructions with the source code. 5. Example source 1 2 3 4 6 7

/* * Example program. Write the longest word in the input. */ program Word; consts: Blocknc = 2;

­ 26 ­

9 10 11 12 13 14 15 16

19 20 21 22 24 25 26 27 28 29 30 31 32 33 34 35 36 38 39 40 41 43 44 45 46 47 48 49 50 51 52 54 55 56 57 58 59 60 61 62 63 64 65

types: Tblock = array[1..Blocknc] of char; Tword = ^Tnode; Tnode = record{ block: Tblock; nc: int; next: Tword; };

function isblank(c: char): bool { return c == ’ ’ or c == Tab or c == Eol; } procedure skipblanks(ref end: bool) c: char; { do{ peek(c); if(c == ’ ’ or c == ’ ’){ read(c); }else if(c == Eol){ readeol(); } }while(not eof() and isblank(c)); end = eof(); } procedure initword(ref w: Tword) { w = nil; } function wordnc(w: Tword): int tot: int; { tot = 0; while(w != nil){ tot = tot + w^.nc; w = w^.next; } return tot; } procedure writeword(w: Tword) i: int; { write("’"); while(w != nil){ for(i = 1, i 0 and w != nil){ if(n