The Sawzall Language Specification

Objective

This document is a semi-formal specification of the Sawzall language. It contains a formal definition of the language syntax with concise explanations of type and semantic rules. It accompanies the Sawzall Language document, which is a more informal but broader description of the language and the supporting environment.

Organization and conventions

To make it easier to learn the language, Sawzall's syntax was designed to match C syntax closely where sensible; this is the case in particular for statements and expressions. Also, identifiers, literals (character, integer, floating point, string and other constants) largely follow C notation. It is the hope that many of the finer points can be ignored in the beginning since most programmers will be able to extrapolate from previous experience. Thus, to facilitate a quick start, in the following the language is presented in a top-down fashion rather than the more conventional bottom-up approach used for language specifications.

To describe the language syntax in detail, Extended Backus-Naur Form (EBNF) is used: Alternatives are separated by vertical bars |. Parentheses ( and ) are used for grouping. Expressions enclosed in square brackets [ and ] are optional. Curly braces { and } denote repetition (0 or more times) of the enclosed expression. Names of EBNF productions referring to terminal symbols start with a lower-case letter; all other production names start with a capital letter. Literal characters and strings are enclosed in single quotes ' and '. A precise definition of EBNF can be found in the Appendix.

Occasionally, the language syntax is presented in a slightly simplified form to facilitate reading. For instance, this is the case for the definition of string literals where a complete formal definition would become very cumbersome to understand, without adding value to the reader. Annotations in small font may be used to explain some of the fine points in those cases.

Programming model

Sawzall was designed for parallel log processing. To make parallelization possible, some restrictions have been built into the language by design. Instead of specifying how to process an entire set of log entries, a Sawzall program specifies how to process an individual log entry independent of any other log entry. Thus several log entries may be processed in parallel by different executions of the same Sawzall program, possibly on different machines.

Since processing is independent, data about all the processed log entries is aggregated externally to a Sawzall program proper. To that end, Sawzall supports the declaration of output types and corresponding output variables. Output variables are the connection to specific aggregators (such as accumulators, collectors, filters, etc.) of data. A special construct, the emit statement, is used to send data to such an aggregator. The aggregator receives the data emitted to it, and aggregates it in a manner specific to its type. For example, a sum aggregator, referred to via an output variable of sum output type, will add up all the values it receives. A collection aggregator will simply collect all the data, etc. The Sawzall language doesn't specify a particular implementation for aggregators, but the Szl implementation does provide an implementation.

Processing of a large set of log entries may take a considerable amount of time even if they are processed in parallel. Furthermore, it is not unlikely to occasionally encouter corrupted log entries. It is important to be able to handle these cases gracefully, without abortion of the process. To that end, Sawzall supports the notion of defined and undefined values. Expressions and statements may be silently aborted if undefined values are encountered. It is the hope that most programs will be able to continue execution even in the presence of occasional errors caused by corrupted input data.

Overall program structure

A Sawzall program is simply a sequence of declarations and statements.

Program = { Declaration | Statement }.

Example

# Program to find the three most commonly used word in a list.

# Input is text lines containing a word and a count, separated with a comma.

  topwords: table top(3) of word: string weight count: int;
  fields: array of bytes = splitcsvline(input);
  w: string = string(fields[0]);
  c: int = int(string(fields[1]), 10);
  if (c != 0) {
    emit topwords <- w weight c;
  }

Declarations and scope rules

A declaration introduces a name (syntactically an identifier) and associates it with a language entity. In Sawzall there are type and (static and non-static) variable declarations.

Declaration = TypeDecl | StaticVarDecl | VarDecl.
Once a name has been introduced with a declaration, it can be used within the scope of the declaration to refer to the associated entity. The scope of a declaration extends from the point of the declaration of the name to the end of the immediate surrounding block. Blocks - and therefore scopes - may be nested. A name declared in a nested scope shadows equal names declared in outer scopes; thus a name always refers to its innermost declaration. A Sawzall program implicitly defines a block enclosing all declarations and statements of the program.

Type declarations

A type declaration associates a type name with a type specification. Within the scope of the declaration, the type name can then be used interchangeably with the type specification.

TypeDecl = 'type' type_name '=' Type ';'.
type_name = identifier.

Examples

type my_bool = bool;
type Coordinates = { x: float, y: float };
type CityMap = map [city_name: string] of Coordinates;

Variable declarations

A variable declaration introduces a variable. The declaration associates the variable with a variable name and a variable type. A variable holds a (possibly undefined) value of the associated variable type.

The variable type may be specified explicitly, followed by an optional initialization expression. Alternatively, the explicit type specification may be elided in favor of the initialization expression, in which case the type of the variable is the type of the initialization expression. Thus, either the variable type or the initialization expression (but not both) may be missing. A variable may be declared static by prefixing the declaration with the keyword static.


StaticVarDecl = 'static' VarDecl.
VarDecl = var_name ':' [Type] ['=' Expression | Block] ';'.
var_name = identifier.
The initialization expression for static variables may be evaluated only once per program run (implementation specific). In particular, if the value of an initialization expression for a static variable changes during program execution, the program is erroneous. Variables of an output type are always static, and the static keyword may be omitted in that case.

Initialization expressions for static variables must not refer to non-static variables; if variable names appear in the expression, the associated variables must be static. In particular, if a (non-intrinsic) function is called in an initialization expression, that function must be stored in a static variable.

A variable declaration with an initialization expression of the form:

variable_name: variable_type = expression;
is a shortcut for the variable declaration followed by an assignment, possibly with a conversion (variable_type). The conversion is omitted if expression already is of type variable_type:
variable_name: variable_type;
variable_name = variable_type(expression);
If the type of expression is exactly variable_type the declaration may be further shortened to:
variable_name := expression;
Notice that even though there is no explicit type specified here, it nonetheless declares a new variable (the : indicates the variable declaration). Obviously, in this case no conversion is implied.

If the initialization expression is a function, the variable type must be a function type. Therefore, a function declaration looks like this:

function_name: function_type = function_type {
  # function body
};
Since there is no type conversion possible for functions, the function types in the declaration must be identical and this syntax can always be shortened to:
function_name := function_type {
  # function body
};
A slightly different notation was previously supported as a special case for functions:
function_name: function_type {
  # function body
};
but that form has been obsoleted by the appearance of the := style.

The evaluation of the initialization expression may result in an undefined value. In that case, the undefined value may be silently assigned to the variable (implementation-specific, see also Undefined values below).

Examples

n := 10;
counter: int = 0;
static pi := 3.14159265;
hypot := sqrt(x*x + y*y);
static word := load("/home/szluser/current/word");
a: array of float = { 2.4, PI, float("1.234") };
unique_language_values: table unique (10) of {language: string, value: string};
average := function(list: array of float): float;
min := function(x: int, y: int): int { if (x < y) return x; return y; };
static country_codes: array of string = LoadCountries("countries.txt");

Undefined values

The value of an expression may be either defined or undefined. If the value is defined, it is one of the set of values specified by its type. If the value is undefined, it does not have a value.

If an undefined value is used within an expression, the value of the expression becomes undefined. In particular, if an undefined value is used as an argument for a function call, the function is not called and the function result is undefined.

If an undefined value is assigned to a variable, the variable becomes undefined. In particular, if an undefined value is stored as array element, tuple field, or map entry, the entire array, tuple, or map becomes undefined. Thus, variables are either entirely defined or entirely undefined. Initially, all variables are undefined until a defined value is assigned to them. Function parameters are defined upon invocation of the function because it can only be called if all arguments are defined.

The intrinsic function def(x: T): bool can be used to test an arbitrary expression x of any type T for defined'ness. def(x) returns true if the value of x is defined; it returns false otherwise. Thus, def() acts like a guard against the propagation of undefined values.

Current implementation

Except for initialization expressions, return expressions, and expressions guarded with def(), by default the current implementation aborts with a run-time error if an undefined value is encountered. When the --ignore_undefs flag is specified with either the szl or saw application, all undefined values are propagated silently, without runtime errors.

Undefined initialization expressions don't cause a run-time trap so that it is possible to write code such as

x: T = expr;  # potentially undefined
if (def(x))
  # x is defined; it can be used w/o runtime trap
else
  # x is undefined; using it will cause a runtime trap
  # (unless used again in one of the exception cases described here)
without the need to re-evaluate expr if its value is defined.

Undefined return expressions don't cause a run-time trap so that it is possible to return an undefined value from a user-defined function; usually to indicate that the function failed. For instance, one might write a function f that returns a result of type T, and also use the result mechanism to indicate if the function was successful or not:

f := function(...): T {
  ...
  # return an undefined value to indicate failure of f
  u: T;
  return u;
};

result: T = f(...);
if (def(result))
  # function succeeded; result is defined
else
  # function failed; result is undefined
}

Types

A type specifies a (possibly infinite) set of values and an interface. Sawzall is a statically typed language, that is, the types of all variables and expressions are known at compile-time. The interface of a type determines which operations are legal on values of that type. The interface is implicitly defined for all the basic types. For composite types, the interface is explicitly defined via their structure.

Value semantics

All Sawzall types have value semantics; a variable always stores an entire copy of a value and not a reference to it. This is of particular importance for composite types such as arrays, maps, and tuples: If a value x held in a variable v1 is assigned to a second variable v2, changing v2 after the assignment - for instance by modifying components of v2 - won't affect the value of v1.

Implementation note

The current implementation uses a copy-on-write scheme to implement value semantics efficiently; a copy is made only when modifying a shared value.

Type specifications

A type specification may be preceded by the proto keyword. In that case, the type specification must be a tuple type (either a type name referring to a tuple type, or a tuple type specification). The resulting tuple type is an automatic proto tuple type generated from the original tuple type (see Automatic proto tuple types for details).

Type =
  type_name | ArrayType | MapType | TupleType |
  OutputType | FunctionType | 'proto' Type.
type_name = identifier.

Basic types

Basic types are intrinsically known to the system and have predeclared names. There are 7 basic types in Sawzall:
bool        # the boolean values true and false
bytes       # arrays of unsigned bytes
int         # 64-bit signed integer values
float       # 64-bit IEEE floating point values
time        # unsigned integral representations of time, with microsecond resolution
fingerprint # unsigned hash values computed by an implementation-defined hash function
string      # arrays of unicode characters
uint        # 64-bit unsigned integer valuse

Composite types

Composite types are composed of 0 or more components. Array, tuple, and map types are composite types. Each component may have an optional component name (documentation only), and each component has a component type.

Component = [component_name ':'] ComponentType.
component_name = identifier.
ComponentType = Type.

Array types

An array is a composite type consisting of an (unspecified) number of components, called elements, which are all of the same type. An array type specifies the type of the elements, the element type. It may also specify an element name, which serves documentary purposes only. The number of elements of an array is called its length. The length of an array is determined at execution time. The elements of an array are designated by indices, which are integer values between 0 and the length minus 1.

ArrayType = 'array' 'of' Element.
Element = Component.

Examples

array of int
array of names: string
array of array of point: Point

Tuple types

A tuple is a composite type consisting of a fixed number of members of possibly different types. Members may be components, which are called (simple or proto) fields. A tuple type specifies a type for each field, and possibly a name. If there is no name, the field is called anonymous. No two named fields within a tuple can have the same name. Members may also be type declarations or static variable declarations. A tuple type acts as a scope for these declarations.

The type of a tuple field, static declaration or type declaration may refer to the name of an enclosing type; that is, type declarations may be recursive. When used in a field or type declaration, the inner type reference must occur within an array type or map type enclosed by the outer type declaration. (That is, there must be a way to omit the value of the field. Otherwise it would be impossible to represent an object of that type.)

Although types may be recursive, objects can neither contain nor refer to enclosing objects or themselves (directly or indirectly). This is because Sawzall is value-based and has no pointers or references.

Types cannot be mutually recursive. Since Sawzall does not have forward references it is not possible to have two types (e.g. tuples) each of which makes use of the other.

Named fields only may also have an integer value called (field) tag associated with it. Tags are used to associate the field with a field in the protocol buffer representation of the tuple (see also Protocol Buffers). No two fields within a tuple can have the same tag, and all tags must be > 0. Tuples with tagged fields are called proto tuples; tuples without tagged fields are called simple tuples. All fields of proto tuples must have tags.

Tagged fields of proto tuples may also specify a constant default value. The default value is used to initialize a field during proto conversion if the corresponding field value is missing in the proto buffer being converted. The special intrinsic inproto(field: T): bool can be used to test whether a proto tuple field was present in the protocol buffer converted into the proto tuple (see also A Manual for the Sawzall Intrinsics).

Tagged fields of proto tuples may also specify a corresponding underlying protocol buffer field type. This is the field type the tuple field is converted into when the tuple is converted from a Sawzall tuple back into the protocol buffer representation. If no protocol buffer field type is specified, a default field type is assumed (see table below).

For historical reasons, a proto tuple type may start with the keyword parsedmessage to indicate that the tuple is a proto tuple. The keyword is treated as a comment.


TupleType = SimpleTupleType | ProtoTupleType.

SimpleTupleType = '{' [SimpleMemberList] '}'.
SimpleMemberList = SimpleMember {',' SimpleMember} [','].
SimpleMember = TypeDecl | StaticVarDecl | SimpleFieldDecl.
SimpleFieldDecl = Component.

ProtoTupleType = ['parsedmessage'] '{' [ProtoMemberList] '}'.
ProtoMemberList = ProtoMember {',' ProtoMember} [','].
ProtoMember = TypeDecl | StaticVarDecl | ProtoFieldDecl.
ProtoFieldDecl = Component ['=' ProtoFieldDefault] '@' proto_field_tag [':' proto_field_type].
ProtoFieldDefault = Expression.
proto_field_tag = integer.
proto_field_type = identifier.
The type declarations for proto tuples are usually automatically generated by the protocol-compiler from a .proto file. The following table shows the relationship between the Sawzall and protocol buffer field types:

Protocol buffer type Sawzall type Comments
bool bool default conversion
boolean bool
string bytes default conversion
fixed32 int may overflow
fixed64 fingerprint default conversion
fixed64 int default conversion
fixed64 time default conversion
int32 int may overflow
int64 int
uint64 uint default conversion
float float may overflow
double float default conversion

For each protocol buffer field type there is a corresponding Sawzall field type; for each Sawzall field type there may be more than one protocol buffer field type. Some Sawzall fields may lose information (overflow may occur) if they are converted into a protocol buffer field of a narrower type (note that all basic Sawzall types - except bytes and string - are 64 bits wide). The default conversions are used if no protocol buffer field type is specified explicitly.

Automatic proto tuple types

A proto tuple type can be obtained automatically by preceding a simple tuple type with the keyword proto. The resulting proto tuple type consists of the original tuple type converted into a parsedmessage proto tuple type augmented with field tags. The tags are assigned in increasing order, starting with tag value 1 for the first field, tag value 2 for the second field, and so forth. The same conversion is applied to any nested tuples in the tuple type (field tags start with tag value 1 again; see also the examples below). A proto keyword applied to a proto tuple type does not change its type (i.e., proto is idempotent).

Implementation restriction

The current implementation may only support a restricted set of constant expressions for field default values of proto tuples.

Examples

{}

{ x: float, y: float, int }

# the Vector tuple type
{ x: float, y: float,
  static Magnitude := function(p: Vector): float {
    return sqrt(p.x*p.x + p.y*p.y);
  }
}

{ ip: int = 0xffffff00 @ 1,
  value: bytes = bytes("britney") @ 2, # proto strings are Sawzall bytes
  timestamp: time @ 5,
  type Server = {
    id: int,
    location: string
  },
  static location1 := "ROB",
  static location2 := "GRI",
  static location3 := "BGI",
}

parsedmessage {
  g: array of TimeProtocol_G @ 1, # 11
  debug: array of bytes @ 4: bytes # 34
}

proto T  # T must be a tuple type

proto {
  x: int,
  y: float
}

proto proto proto {  # proto is idempotent
  x: int,
  t: {
    s: bytes,
    t: bytes
  }
}

parsedmessage {  # this type is equivalent to the previous one
  x: int @ 1,
  t: parsedmessage {
    s: bytes @ 1,
    t: bytes @ 2
  } @ 2
}

Map types

A map is a composite type consisting of an (unspecified) number of components, called key-value pairs, which are all of the same type. A map type specifies the key and value types. It may also specify key and value names, which serve documentary purposes only. The number of key-value pairs of a map is called its length. The length of a map is determined at execution time. The values of a map are designated by their keys, which are values of key type.

MapType = 'map' '[' Key ']' 'of' Value.
Key = Component.
Value = Component.

Examples

map [int] of bool
map [symbol: string] of int
map [point: {x: float, y: float}] of name: string

Output types

Output types specify language-external containers which aggregrate data in a type-specific fashion. Sawzall is extensible with respect to output types. The following output types are supported:
  1. collection: a simple collection or concatenation of the data
  2. maximum: a precise sample of the N highest-weighted data items.
  3. minimum: a precise sample of the N lowest-weighted data items.
  4. sample: a statistical sampling of N items.
  5. set: a set (unique elements) containing at most N items per index.
  6. sum: an arithmetic sum of the data.
  7. top: statistical estimators for the`most frequent N' data items.
  8. unique: statistical estimators for the total number of unique data items.
  9. weightedsample: a statistical sampling of N items, biased towards items with higher weights.
  10. recordio: output to a simple record-based binary file.
Each kind of output type specifies a particular aggregation method. Output types are used to declare output variables which represent the connection to the language external aggregators. emit statements are used to send data to an aggregator.

Output types may be parametrized and indexed. Parameters are used to set up the aggregation variable. Indices are used to create "arrays" of aggregators. Unlike with arrays, indices of various types (not just int) may be specified. Some output types require the specification of a weight type. When emitting to a variable of output type, the weight is used to scale the emitted value for aggregation.


OutputType =
  'table' table_type [Parameter] {Index} 'of' Element [Weight]
  [FileSpec | ProcSpec ] [FormatSpec].
table_type = identifier.
Parameter = '(' Expression ')'.
Index = '[' Component ']'.
Element = Component.
Weight = 'weight' Component.
FileSpec = 'file' '(' ArgumentList ')'.
ProcSpec = 'proc' '(' ArgumentList ')'.
FormatSpec = 'format' '(' ArgumentList ')'.
ArgumentList = ExprList.
ExprList = Expression {',' Expression }.
If a file or proc specifier is present, data is not emitted to an external aggregator, but to a file or process respectively, with the file or process name computed via the specifier as described below. Unless a format specifier is present as well, the element type must be bytes. file or proc specifiers are only allowed for collection output types.

If a format specifier is present, any value emitted is formatted into a string as expressed via the specifier. The emitted values are always of type string, independently of the element type specified in the output type; except if there is a file or proc specifier present as well, in which case the string is converted into (UTF-8 encoded) bytes before emission.

The file, proc, and format specifier format expressions similar to printf in C: The first argument must be a string possibly containing format characters. The remaining arguments must correspond in number and type to the format characters in the format string. Within the arguments of file or proc specifiers, output type index values may be referred to via the corresponding component name of a particular index. Within the arguments of format specifiers, the element value may be referred to via its component name.

Examples

# Type of intrinsic output variable stdout
table collection of x: string file("/dev/stdout") format("%s\n", x)

# Type used to collect all the values into a single stream
table collection of string;

# Type used to count the number of times each value is seen
table sum[value: string] of count: int;

# Type used to record the top 10 values for each category
table top(10)[category: string] of value: string weight count: int;

# Type used to record the ten most expensive operations
table maximum(10)[category: string] of operatikon: string weight cost: float;

# Type used to count how many unique values there are, using an (internal)
# sampled table of 10000 values to estimate the distribution
table unique(10000) of value: string;

Function types

A function type specifies the set of functions with a particular signature. The signature of a function is the (ordered) list of parameter types and the result type, if any.

FunctionType = 'function' '(' [ParameterList] ')' [ResultSpec].
ParameterList = Parameter {',' Parameter}.
Parameter = identifier ':' Type.
ResultSpec = ':' Type.

Examples

function()
function(n: int): int
function(name: string, hint: Coordinates): Coordinates

Equality of types

Structured types are equal if they have the same structure. Array types have the same structure if the element types are equal. Map types have the same structure if both the key and value types are equal. Tuple types have the same structure if they have the same number of fields and all fields match. Fields match if they both have a name and they are equal, or both don't have a name; and if they have equal types. Function types are equal if they specify the same number of parameters and corresponding (by position) parameter types are equal, and if their return types are equal (or missing everywhere).

Statements

Sawzall supports most C statements and they match C syntax closely, except for the switch statement which is more structured in Sawzall and has a slightly extended syntax. Additionally, Sawzall defines an emit statement for emission of data into output variables; and a when statement for the (possibly parallel) repeated execution of statements.

Statement =
  Assignment | Block | BreakStatement | ContinueStatement | DoStatement |
  EmitStatement | ExprStatement | ForStatement | IfStatement | ResultStatement |
  ReturnStatement | SwitchStatement | WhenStatement | WhileStatement.

Assignments

An assignment statement serves to replace the current value of a variable or a component of the variable on the left-hand side by a new value specified by the expression on the right-hand side. The type of the variable and the expression must be equal. No evaluation order is specified. In particular, a program is erroneous if the value of the right-hand side expression depends on the evaluation of the left-hand side variable and vice versa. For example, the assignment a[f()] = g(); is erroneous if calling f() before g() changes the result of g(), and vice versa.

Assignment = Factor '=' Expression ';'
If either the left-hand side or the right-hand side of the assignment is undefined, the variable designated by the left-hand side becomes entirely undefined. For example, in the assignment a[i] = x; the array a becomes undefined if either the index i or the value x is undefined.

Blocks

A block groups several declarations and statements into a single unit. A block can be used where only a single statement is expected. A block also introduces a new scope. Execution of a block causes sequential execution of the enclosed declarations and statements.

Block = '{' { Declaration | Statement } '}'.

Break statements

A break statement terminates the execution of the smallest enclosing for, while, or switch statement. break statements may not appear if there is no such enclosing statement or if there is a when statement between the break and the enclosing statement. Execution continues with the statement immediately following the terminated statement.

BreakStatement = 'break'.

Continue statements

A continue statement causes control to continue at the the end of the loop's substatement of the smallest enclosing do, for, or while loop. continue statements may not appear if there is no such enclosing statement or if there is a when statement between the continue and the enclosing statement.

ContinueStatement = 'continue'.

Do statements

A do statement specifies conditional repetition. The substatement is executed as long as the expression (including side-effects) evaluates to true; but at least once. The substatement execution and expression evaluation is repeated as long as the expression evaluates to true.

DoStatement = 'do' Statement 'while' '(' BoolExpression ')' ';'.
If the boolean expression has become undefined, the do statement is aborted and execution continues with the next statement.

Example

do
  d++;
while (d*d < x);

Emit statements

An emit statement specifies a value to be aggregated with a possibly indexed variable of output type. A weight may be specified for output types that require it. The indices, the value to be emitted, and the weight are evaluated, and possibly formatted as specified by the output type, and then transmitted to the language-external container specified by the variable. The evaluation order is not specified; in particular, a program that depends on a particular evaluation order is erroneous.

EmitStatement = 'emit' OutputDesignator '<-' Expression ['weight' Expression] ';'.
OutputDesignator = var_name {'[' Expression ']'}.

Examples

emit stdout <- "Hello World!";
emit valuecount[values[somej]] <- 1;

Expression statements

An expression statement is an expression executed for its side-effects. If the expression specifies a variable, array element, or tuple field of type int, and the expression is followed by an increment ++, or a decrement -- operator, the variable, array element, or tuple field is incremented or decremented by 1.

ExprStatement = Expression [ '++' | '--' ] ';'.
If an undefined value occurs during expression evaluation, expression evaluation is aborted and execution continues with the next statement.

Examples

i++;
a[i]--;
f(x, y);

For statements

A for statement specifies conditional repetition. The first expression is evaluated once and thus specifies initialization of the loop. The second expression must evaluate to a boolean value; it is evaluated before each iteration, and once it becomes false, the for loop is terminated. The third expression is evaluated after each iteration. Any of the expressions can be dropped. Omitting the second expression is equivalent to having the boolean constant true as second expression.

ForStatement =
  'for' '(' [ForDeclExpr] ';' [BoolExpression] ';' [ForDeclExpr] ')' Statement.
ForDeclExpr = Declaration | ExprStatement.
A for statement of the form:
for (expr1; cond; expr2) statement;
is syntactic sugar for the block:
{ expr1;
  while (cond) {
    statement;
    expr2;
  }
}

If statements

An if statement specifies the conditional execution of statements. If the boolean expression evaluates (including side-effects) to true, the first statement is executed. If the boolean expression evaluates to false, the 2nd statement (following the else clause) is executed, if present.

IfStatement = 'if' '(' BoolExpression ')' Statement ['else' Statement].
If the boolean expression is undefined, the if statement is ignored and execution continues with the next statement (following the if statement).

Result statements

A result statement produces a value for and exits the innermost statement expression containing the result statement. All the result statements within an statement expression must produce values of the same type. It is an error if program execution reaches the end of a statement expression without executing a result statement.

ResultStatement = 'result' Expression.
A result statement cannot be used inside a function without an enclosing statement expression.
e := ?{
  j := function(): int { result 1; };  # compile-time error
};

Return statements

A return statement exits the innermost function containing the return statement. If the function declaration specifies a return type, the return statement must specify an expression of the same type as that return type. It is a run-time error if program execution reaches the end of such a function without executing a return statement. If the return statement is not within a function, it terminates the program upon execution.

ReturnStatement = 'return' [ Expression ].
If the return value is undefined, the function returns an undefined result. A return statement cannot be used inside a statement expression without an enclosing function.
x := function(): int {
  n := ?{
    return 2;  # compile-time error
  };
};

Switch statements

A switch statement specifies the selection of a sequence of statements depending on a tag. Note that the Sawzall switch statement differs from a C switch statement in a couple of ways: syntactically, more than one case can be mentioned per case keyword; and semantically, the tag value is compared with individual case values in order of appearance. Individual case values need not be compile-time constants. In particular, the Sawzall switch statement works also for tag types other then int. Furthermore, an implicit break statement is appended to each statement list per case (no fall through).

SwitchStatement = 'switch' '(' Expression ')' '{' { Case } Default '}'.
Case = 'case' CaseLabelList ':' StatementList.
CaseLabelList = Expression {',' Expression}.
StatementList = Statement { Statement }.
Default = 'default' ':' StatementList.
A switch statement of the form:
switch (tag) {
  case a0, a1, ... an:
    as;
  case b0, b1, ... bn:
    bs;
  ...
  case z0, z1, ... zn:
    zs;
  default:
    s;
}
is syntactic sugar for the block (assuming no undefined values and break statements):
{ $tag: typeof(tag) = tag;
  if ($tag == a0 || $tag == a1 || ... $tag == an)
    as;
  else if ($tag == b0 || $tag == b1 || ... $tag == bn)
    bs;
  ...
  else if ($tag == z0 || $tag == z1 || ... $tag == zn)
    zs;
  else
    s;
}
If tag is undefined, the switch statement is ignored, and execution continues with the next statement. If any of the case labels is undefined, the effect is as if that case were not mentioned. Since this can lead to a potentially 'empty' switch statement, the default case is mandatory. Break statements can be used to exit a switch statement before the end of a statement list in a case (breaks are implicit at the end of each case).

Implementation restriction

The current implementation only supports tag values of basic types (including string, and bytes).

When statements

A when statement specifies the (potentially parallel) repeated execution of a statement for a set of quantifier values satisfying a condition. A quantifier variable is a variable declared within the when statement that is then evaluated within the condition. The quantifier variable assumes (is bound to) values that make the condition true. There are 3 kinds of quantifiers: The scope of quantifiers is the when statement. If there is more than one quantifier, the first quantifier is bound first, the 2nd quantifier is bound second, etc.; i.e., the set of possible values for a 2nd quantifier may depend on the bound value of the 1st quantifier. In particular, swapping the order of quantifiers may change the semantics of the when statement.

WhenStatement = 'when' '(' {QuantifierDecl} BoolExpression ')' Statement.
QuantifierDecl = var_name ':' quantifier Type ';'.
quantifier = 'all' | 'each' | 'some'.
If the boolean expression is (or has become) undefined for all possible values of the quantifiers, the when statement is aborted and execution continues with the next statement.

break or continue statements are not permitted in a when statement unless the corresponding switch statement or loop is also in the when statement.

Implementation restriction

The current implementation requires that quantifiers are used as stand-alone indices (i.e., not as part of a more complex expression) in array or map subscripts. The arrays and maps subscripted by the quantifiers are used to determine the range of quantifier values. If such an array or map changes during the evaluation of the when statement, the results are unspecified.

Example

# Count the relevant values
when (
  somei: some int;
  somej: some int;
  lowercase(valuewords[somei]) == values[somej]
)
  emit valuecount[values[somej]] <- 1;

While statements

A while statement specifies conditional repetition. If the boolean expression (including side-effects) evaluates to true, the substatement is executed. The expression evaluation and substatement execution is repeated as long as the expression evaluates to true.

WhileStatement = 'while' '(' BoolExpression ')' Statement.
If the boolean expression is (or has become) undefined, the while statement is aborted and execution continues with the next statement.

Example

while (d*d < x)
  d++;

Expressions

Expressions

Expressions specify how values are computed by applying operators on factors. The syntax of expressions distinguishes between 5 classes of binary operands with different precedences (binding strengths) and 1 class of unary operators with highest precedence. Operators of the same precedence associate from left to right; for example, x - y - z stands for (x - y) - z.

Expression = Conjunction {('||' | 'or') Conjunction}.
Conjunction = Comparison {('&&' | 'and') Comparison}.
Comparison = SimpleExpr [relation SimpleExpr].
relation = '==' | '!=' | '<' | '<=' | '>' | '>='.
SimpleExpr = Term {add_operator Term}.
add_operator = '+' | '-' | '|' | '^'.
Term = Factor {mul_operator Factor}.
mul_operator = '*' | '/' | '%' | '<<' | '>>' | '&'.

Factors

Factors are operands, possibly followed by a selector, index expression, or invoked as function call.

Factor = Operand { Selector | Index | Call | SawCall }.
Selector = '.' field_name.
field_name = identifier.
Index = '[' Expression [':' Expression] ']'.
Call = '(' ArgumentList ')'.
SawCall =
  '(' [IterationCount ','] InputString
  ',' RegexpList [',' 'rest' Factor] ')'.
IterationCount = Expression.
InputString = Expression.
saw_name = identifier.
RegexpList = Regexp {',' Regexp}.
Regexp = ['skip'] ['submatch'] string.

Member selection

If the operand is a type identifier T, then T.m selects the tuple member m, and m must refer to the type name of a type declaration, or the variable name of a static variable declaration in the tuple type T. If the type of an operand T is a tuple, then T.f selects the field f of the tuple.

Array indexing

If the type of an operand A is an array, string, or bytes, then A[i] selects the element of A whose index corresponds to the current value of i. If the index i of A[i] is not within the bounds of the array, the expression A[i] becomes undefined. Furthermore, A[i:j] selects a sub-array or slice consisting of the elements A[m], A[m+1], ... A[n-1] with m = max(0, i), and n = min(len(A), j). Note that slices are never undefined, they simply become empty (i.e., their length is 0) if the range is entirely outside the array boundaries.

Map indexing

If the type of an operand M is a map, then M[k] selects the element of M whose key corresponds to the current value of k. If M[k] appears on the right-hand side of an assignment or in an expression, and there is no map element for the key k, then M[k] is undefined. If M[k] is the left-hand size of an assignment of the form M[k] = v;, then v becomes the value associated with the key k.

Function calls

If the type of an operand F is a function, then F(p1, p2, ... pn) expresses the call of F with parameters p1, p2, ... pn. The parameter expressions are evaluated and assigned to the formal function parameters before the call; no evaluation order is specified. In particular, a program that depends on a particular evaluation order of parameters is erroneous. If F is defined by the Sawzall program (i.e., it is not an intrinsic function), the number and types of the parameter expressions must correspond (by position) with the type of F. If F is an intrinsic function, function specific rules apply. In particular, if F specifies a saw call (one of saw, sawn, or sawzall), saw-specific syntax applies to the function call. If the operand F is a type name, then the call F(x, arg1, arg2, ...) is syntactic sugar for the call convert(F, x, arg1, arg2, ...). Note that parameter passing is simply a special form of assignment; and thus is done by value in Sawzall.

Conversions

If an operand T is an identifier designating a type, then T(x, p1, ... pn) is a shortcut for the conversion function call convert(T, x, p1, ... pn), where convert is a special conversion intrinsic, T the type into which the value x is to be converted, and p1, p2 ... pn, are (potentially missing) parameters specifying the conversion more precisely. For example, int(x, 10) is the same as convert(int, x, 10). Conversions are defined between most basic types, and between some array types as well. Currently a detailed list of all implemented conversions is missing, but some additional information can be found in Sawzall Language.

Operands

Operands are type or variable identifiers, literal values, composite values, anonymous functions (closures), unary operators applied to other factors, $ symbols, nested expressions, or statement expressions. A type identifier denotes the corresponding type. It does not have a value and must be followed by a member selector, conversion call, or may be used as an argument for certain special intrinsics. A variable identifier stands for the current value of the variable. A literal value specifies a constant value. A $ symbol must only be used within an index expression and it stands for current length of the subscripted array, string, or bytes value. Parentheses are used to specify different evaluation orders than defined by the operator precedences.

Operand =
  identifier | literal | Composite | Function | unary_operator Factor |
  '$' | StatementExpr | '(' Expression ')'.
unary_operator = '+' | '-' | '~' | '!' | 'not'.

Composites

A composite specifies an tuple, array, or map "literal". The value of the composite is computed by evaluating all the expressions within the composite. The evaluation order is not specified. In particular, a program that depends on a particular evaluation order is erroneous. Composite types are target-typed, i.e., their type depends not only on the types of the individual composite expressions, but also where the composite occurs in the program:

Composite = '{' [ExprList | PairList | ':'] '}'.
PairList = Pair {',' Pair}.
Pair = Expression ':' Expression.

If any expression in the composite is undefined, the value of the composite is undefined.

Examples

{}
{:}
{ 'a', 'b', 127 }
{ 12, "hello", 3.14, {} }
{ "foo" : 3, "bar" : 7 }

Functions

A function consists of a function type and a block. The function type specifies the parameters and result type, if any. The block contains declarations and statements. The value of a function also binds the function's context, i.e., within the function, variables defined in scopes (usually other functions) surrounding the function are accessible. The function type may be explicitly specified or be the type name of a function type.

Function = Type Block.
A function that is used as initialization expression for a static variable may not access non-static variables in surrounding scopes. Thus, such a function cannot produce any side-effects and its result depends only on the (global) state of static variables and the function arguments.

Examples

function() {}
function(radius: float): float { return 2.0 * PI * radius; }
Action { kill_bill(); }

Intrinsic functions

The Sawzall implementation predefines a set of intrinsic functions; i.e., functions built into the system. Intrinsic functions may support variable length parameter lists (e.g., format()), optional parameters (e.g., assert()), and polymorphic parameter types (e.g., min()). The intrinsic functions can be roughly grouped into the following classes: Because many intrinsics are highly specific and may change over time, these functions are not described in this document in detail. Instead, the reader is referred to the Manual for Sawzall Intrinsics document.

Statement Expressions

A statement expression produces a value from the execution of an arbitrary sequence of statements, embedded in an in-line block. The value of a statement expression is the value of the first result statement executed during its evaluation.

StatementExpr = '?' Block .

Examples

# signum
signum := ?{ switch(true) { case x < 0: result -1; case x > 0: result 1; default: result 0; };

# array filtering
a = ?{
  aa: array of int = {};
  when (i: each int; Test(a[i])) aa = aa + {a[i]};
  result aa;
};
The block in a statement expression must contain at least one result statement, which determines the type of the statement expression.

Arithmetic operators

symbol	result
+	sum
-	difference
*	product
/	quotient
%	modulo
&	bitwise and
|	bitwise or
^	bitwise xor
~	bitwise negation
<<	shift left
>>	shift right (logical)
The behavior of shifts is unspecified if the shift count is greater than or equal to the word size. If an expression contains an integer division or modulo operation by 0, the expression becomes undefined. The operators / and % are related by the following formula:
	x  =  (x / y) * y  +  (x % y)
The operator ^ is related to the operators &, |, and ~ as follows:
	x ^ y = (x & ~y) | (y & ~x)

Comparison operators

symbol  comparison
==	equal
!=	not equal
<	less
<=	less or equal
>	greater
>=	greater or equal

Boolean operators

and	true if and only if both arguments are true; false otherwise (deprecated)
or	false if and only if both arguments are false; true otherwise (deprecated)
not	true if operand is false; false otherwise
&&	like and, but evaluates rhs only if lhs is true
||	like or, but evaluates rhs only if lhs is false
!	same as not
Note that and and or evaluate both arguments unconditionally.

Array operators

symbol	result
+	concatenation

Operator Precedences

Operator precedences and associativities are summarized in the table below.

Precedence Category Operators Associativity
6 (highest) unary
+   -   ~   !   not
right to left
5 multiplicative
*   /   %   <<   >>   &
left to right
4 additive
+   -   |   ^
left to right
3 comparison
==   !=   <   <=   >   >=
left to right
2 conjuction
&&   and
left to right
1 (lowest) disjunction
||   or
left to right

Examples for expressions

2003				# an int
i + 1				# an int
0b100 << i			# an int
(x - a) * (x + a)		# a float
{"the", "answer", "is", 42}	# a 4-tuple {string, string, string, int}
{1.2, 2.3, 3.4, 4.5, 5.6}	# a 5-tuple and also an array of float holding 5 elements
a[i % $]			# (i % len(a)).th element of array a
x.f				# field f of tuple x
{1, 2, 3, 4, 5}[i]		# i.th element of constructed array
(0 <= i) and (i < n)		# true if i in [0, n)
min(x, y)			# call of function min with arguments x and y
a[i:j]				# the array (a[i], a[i+1], .. a[j-1])

Lexical structure

The Sawzall language is the infinite set of sentences produced by the language syntax. Each sentence is called a Sawzall program. A program consists of a finite sequence of terminal (lexical) symbols. These symbols in turn are composed of finite sequences of characters. In the Sawzall language, terminal symbols can be classified into identifiers, literals, operators, and delimiters. Arbitrary amounts of white space, comments, and include clauses may appear between terminal symbols.

Encodings

The program text is in the Unicode character set encoded using UTF-8 (which is fully upwards compatible with 7-bit ASCII). String literals, character literals, and comments may therefore contain Unicode characters.

Comments

Comments start with a # and end at the end of a line. They may appear wherever white space may appear. Comments are treated like white space, and thus have no impact on the meaning of a program.

Examples

# Copyright 2010
################

Include clauses

An include clause can be used to include the contents of another file. The effect is as if the include clause was replaced by the contents of the included file surrounded by white space (i.e., a symbol at the end of an included file cannot "continue" in the file containing the include clause).

Include = 'include' file_name.
file_name = string.

Platform-specific rules

Under Linux and OS X, if file_name specifies a relative path (i.e., does not start with '/'), the directory location of the file is relative to the including file (i.e., the file containing the include clause), or relative to any number of directories specified as include directories via command line arguments; unless file_name starts with a '.' in which case it must be located relative to the including file. If other platforms are supported they do not necessarily follow this rule.

Examples

include "/home/szluser/stuff"
include "lib/qsort.szl"

Proto clauses

A proto clause is similar to the include clause except that the file that is included is generated from the protocol buffer description mentioned in the proto clause. The file is generated by the protocol compiler using the --lang=sawzall command line flag.

Proto = 'proto' file_name.

Platform-specific rules

The same platform-specific rules apply as for include clauses.

Example

proto "/home/szluser/logfile.proto"

Terminal symbols

Identifiers

An identifier is a sequence of letters and digits. The first character must be a letter; the underscore '_' counts as a letter. Upper and lower case are distinct. Identifiers may have any length. Two identifers are equal if they have the same length and if they match character for character.

identifier = (letter | '_') {letter | '_' | digit}.
letter = 'A' .. 'Z' | 'a' .. 'z'.
digit = '0' .. '9'.
Various identifiers are predefined. They refer to basic data types, intrinsic functions, incoming parameters, etc. The following is an incomplete list:
int float string time bytes bool fingerprint output stdout stderr _undef_cnt
_undef_details true false PI SECOND SEC MINUTE MIN HOUR HR DEBUG def
convert new regex saw sawn sawzall fingerprintof format len haskey inproto
keys lookup max min assert addday addmonth addweek addyear dayofmonth
dayofweek dayofyear hourof minuteof monthof secondof yearof trunctoday
trunctohour trunctominute trunctomonth trunctosecond trunctoyear now highbit
load getenv lowercase uppercase match strfind strrfind strreplace matchposns
matchstrs ln log10 exp pow ___raise_segv rand nrand tobase64 frombase64
splitcsvline splitcsv

Platform-specific predefined identifiers

A list of all predefined identifiers can be obtained with the command: szl --explain=

Literals

Literals are symbols denoting constant values. There are integer, unsigned integer, fingerprint, floating point, char, string, and time literals. The literal syntax follows ANSI C literal syntax closely where possible.

literal = integer | unsigned_integer |
          fingerprint | floating_point | char | string | time.

Integer literals

An integer literal can be specified in binary, octal, decimal, or hexadecimal form. Integers may be signed.


integer = ['-'] (binary_int | octal_int | decimal_int | hexadecimal_int).

binary_int = '0' ('B' | 'b') bin_digit {bin_digit}.
octal_int = '0' oct_digit {oct_digit}.
decimal_int = dec_digit {dec_digit}.
hexadecimal_int = '0' ('X' | 'x') hex_digit {hex_digit}.

bin_digit = '0' .. '1'.
oct_digit = '0' .. '7'.
dec_digit = '0' .. '9'.
hex_digit = '0' .. '9' | 'A' .. 'F' | 'a' .. 'f'.

Examples

1		# = 1
011		# = 9
0b1010		# = 10
42		# = 42
0x7f		# = 127

Implementation note

The syntax for operands permits a unary operator - in front of an expression; thus a negative integer value can be written in two different ways. For instance, -3.14 can be written as -3.14 (signed integer literal) or as - 3.14 (- operator followed by unsigned integer literal). This apparent redundancy is required to permit the entry of the most negative integer value -0x8000000000000000 as a literal (the corresponding positive value 0x8000000000000000 is too large to be represented).

Unsigned integer literals

An unsigned integer literal looks like a regular integer literal except that it is not signed and must have a 'U' or 'u' suffix. An unsigned integer literal has type uint, not int.


unsigned_integer = (binary_int | octal_int | decimal_int | hexadecimal_int) ('U' | 'u').

Examples

1u		# = 1
011u		# = 9
0b1010U		# = 10
42u		# = 42
0x7fU		# = 127
18446744073709551616U	# = largest possible 64-bit value

Fingerprint literals

A fingerprint literal consists of an integer literal immediately followed by a 'P' or 'p'. The value of the fingerprint is the value of the integer.

fingerprint = integer ('P' | 'p').

Examples

0p		# the fingerprint with value 0
0x6347P		# the fingerprint with value 25415

Floating point literals

A floating point literal consists of an integer part, a decimal point, a fraction part, and an optional scale factor. The scale factor consists of an 'E' or 'e' followed by a possibly signed exponent. Either the integer or the fraction part (not both) may be missing; either the decimal point or the scale factor (not both), may be missing. Floating point numbers may be signed.

floating_point = ['-'] [integer_part]['.' [fraction_part]] [scale_factor].

integer_part = decimal_int.
fraction_part = decimal_int.
scale_factor = ('E' | 'e') ['+'|'-'] exponent.
exponent = decimal_int.

Examples

.01		# = 0.01
1e-3		# = 0.001
2.		# = 2.0
3.1415		# = 3.1415
2.18E5		# = 218000.0

Implementation note

A negative floating point value can be written in two different ways: As a negative floating point literal or as a positive floating point literal preceded by a minus sign (see also the implementation note on integer literals). This syntactic redundancy has no effect on program semantics and is simply a side effect of the current implementation (which shares code for the lexical analysis of integer and floating point literals).

Char literals

A char literal consist of an ordinary (Unicode) character or an escaped character, enclosed in single quotes '. The value of an ordinary character is the integer corresponding to the Unicode code point of the character. The value of an escaped character matches the C definition.

char = ''' character '''.
character = ordinary_character | escaped_character.
escaped_character =
  '\' oct_digit [oct_digit [oct_digit]] |
  '\' 'x' hex_digit {hex_digit} |
  '\' 'u' hex_digit hex_digit hex_digit hex_digit |
  '\' 'U' hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit |
  '\' ('a' | 'b' | 'f' | 'n' | 'r' | 't' | 'v' | '\' | '"') |
  '\' ordinary_character.
An ordinary character is any printable Unicode character except double quote " or backslash \. The value of an ordinary character prefixed with '\' is the value of the ordinary character (e.g., '\c' == 'c').

Examples

'\n'		# = 10
'A'		# = 65

String literals

A string literal is a sequence of characters enclosed in double quotes " or back quotes `. The value of a string literal is an array of integers with the element values corresponding to the Unicode values of the string's characters. If the string is in double quotes, escaped characters are interpreted as for char literals. If the string is in back quotes, no escaping takes place. This is particularly useful for regular expressions.

string = '"' {character} '"' | '`' {character} '`'.
The above syntax for strings is somewhat simplified to facilitate reading. In particular, it doesn't exclude " or ' characters from the respective string definitions. To include a double quote character in a string enclosed by double quotes, escape the double quote character: "\"". Back quotes cannot be included directly in a string enclosed by back quotes since they cannot be escaped in such a string by definition. Instead, use something like: `string1` + "`" + `string2`. Also, it is illegal for to place a character with value 0 in a string literal.

Examples

""		# empty string
"Sawzall\n"	# Sawzall followed by a newline character
`"Hello"`       # "Hello" (including the "'s)

Bytes literals

There are two forms of bytes literals. The first comprises a string literal immediately prefixed by the letter 'B'. The value of such a bytes literal is the sequence of bytes generated by interpreting each character in the string as a single byte. It is erroneous if a character in the string literal has value larger than 255. Thus B"\xff" contains one byte (value 255) while the regular string "\xff" contains one character whose UTF-8 representation would require two bytes. The other form is a string literal prefixed by the letter 'X'. In this form the string must be an even number of hexadecimal digits. Each successive pair of digits represents one byte of the value. Note: it is illegal for a string constant to represent a character with value zero, but a bytes literal may contain bytes with value zero.

bytes = 'B' string | 'X' hex_string.
hex_string = '"' { hex_digit hex_digit } '"' .

Examples

B""         # empty bytes literal
B"hello"    # five bytes
B`"`        # one byte, a double quote
B"\xff"     # one byte, value 255
X"0011AB"   # three bytes: 0x00 0x11 0xAB

Time literals

A time literal consists of an unsigned integer literal immediately followed by a 'T' or 't'. The value of the time literal is the value of the integer interpreted as microseconds. Alternatively, a date and time string in double quotes immediately preceded by a 'T' can be used to specify a time. The value of such a time string literal is the number of microseconds from Dec 31 16:00:00 PST 1969.

time = integer ('T' | 't') | 'T' string.

Examples

0t		# 0 microseconds
1000000T	# 1 second
T"Tue Jun  5 10:43:07 America/Los_Angeles 2007"
T"Wed Feb  4 16:26:41 PST 2004"

Sawzall's time zone support is built upon the ICU C++ code. You can use any time zone recognized by that code. "America/Los_Angeles"-style identifiers are encouraged.

Operators and delimiters

Operators and delimiters are special character sequences and reserved words. Reserved words cannot be used as identifiers (with the exception of format which is also the identifier for an intrinsic function).
+	-	*	/	%	&	|	^
<<	>>	==	!=	<	<=	>	>=
(	)	[	]	{	}	&&	||
~	!	.	,	;	:	=	<-
$	--	++

all	and	array	break	case	continue	default	do	each
else	emit	file	for	format	function	if	include	map
not	of	or	parsedmessage	proc	proto	rest	result	return
skip	some	static	submatch	switch	table	type	weight	when	while

Appendix

EBNF Syntax

The syntax of EBNF can be described within EBNF:

Production = production_name '=' Expression '.'.
Expression = Term { '|' Term }.
Term = Factor { Factor }.
Factor = production_name | symbol | Group | Option | Repetition.
Group = '(' Expression ')'.
Option = '[' Expression ']'.
Repetition = '{' Expression '}'.
production_name = identifier.

Acknowledgments

We have freely borrowed language and syntactical structure from the following sources: