Sawzall

Introduction

Given large amounts of data that must be analyzed either as it is generated or by examining historical data, there is a need for an efficient process for doing the analysis and an appropriate language for expressing the programs that perform the analysis. Sawzall, the language described informally by this document, is a notation for expressing such programs. This document describes the language; a formal syntax is provided in the Sawzall Specification document. Finally, and perhaps most usefully, there is an academic paper that discusses the motivation, design, and implementation.

Overview

One can model the Sawzall language as a kind of specialized distributed Awk. It is a pattern/action language: records are matched against a set of patterns; for each successful match, an associated action is performed. But unlike Awk, which has a restricted pattern matching language coupled to a general-purpose action language, Sawzall has only one notation for expressing the matching algorithms and the resulting actions. That notation is a fairly traditional procedural programming language.

The input to the language will usually be transactional data such as logs. Output from the language will be usually eventually be stored in a database that aggregates and stores statistics. Output is defined in terms of data structures that, like Awk associative arrays or Python dictionaries, synthesize new entries as new indices are mentioned. The language also supports some unusual forms of sampling data collectors to facilitate special functions such as creating `Top 10' lists and counting the number of unique events.

A motivation for the design is efficiency, particularly in the distributed case. If the data is scattered across many computers in a distributed file system, it should be possible to distribute the processing across many machines simultaneously for enormous speedups relative to a single machine scanning the data.

This goal of parallelism leads to several unusual properties of the language:

The rest of this document describes the elements of the language in more detail.

Detailed Design

This is not a formal definition of the language, but an informal guide to serve as background for discussion of the design.

Lexical conventions

Input is free-form, with semicolons used to terminate statements, as in C. Source code is in UTF-8.

Comments are introduced by a # character and continue until end of line.

An include clause is replaced by the contents of the file it names, as in:

include "/home/szluser/stuff"

A proto clause is similar to an include clause but is replaced by the translation into Sawzall of the protocol buffer specification in the named file, as in:

proto "logdata.proto"

Identifiers are as in C: alphanumeric with underscores.

Numbers, strings, and other constants are a subset of those in C: 23, 0x7f, 1.23e-4, "hello\n", 'x', etc. The only signedness modifier for integer constants is a trailing u or U to indicate an unsigned integer (uint) constant (examples: 0u, 0x7FFU). Otherwise integers are always signed, characters are always positive ints, and the sizes are big enough for most practical purposes. String and character constants are interpreted as representations of Unicode characters and may contain C-like character escapes, including \u1234 to represent Unicode code point U+1234 and \U00101234 for code point U+101234.

A special syntax for strings is convenient when writing regular expressions. Since regular expressions often contain backslashes, writing them in C notation can be clumsy; one needs to double all the backslashes. The pattern \s*\+\d+ (optional white space, literal plus sign, digits) must be written "\\s*\\+\\d+" and so on. The special syntax uses back quotes instead of double quotes; within a back quoted string, backslashes are uninterpreted. Our example regular expression is cleaner; it's just the regular expression in backquotes: `\s*\+\d+`. (If you need a backquote in such a string, you can use string addition (described later) to create it. For instance, `abc` + "`" + `def` is the same string as "abc`def".) Note that `\t` is a two-character string (backslash, t) while "\t" is a one-character string (tab). This notation still interprets the string as UTF-8; the only difference is the handling of backslash.

No string or character constant can straddle a newline and string concatenation is not implicit as it is in C and C++; use the + operator to concatenate strings.

Basic types

There are several basic types used throughout the language: bool, int, uint, float, bytes, string, time, and fingerprint.

The bool type is a boolean value. The bool constants true and false are predefined.

The int type is a 64-bit signed integer, while the uint type is a 64-bit unsigned integer. Although it might make sense to have various sizes of integers, integer processing is expected not to dominate the run-time of the language and it seems preferable to keep things simple.

The float type is a 64-bit IEEE floating point number. Again, in the interest of simplicity, there is only one floating-point type.

The string type stores an array of Unicode characters of unspecified representation. Indexing operations on the array will access individual characters and treat them as positive ints.

The bytes type stores an array of unsigned bytes. Indexing operations on the array will access individual bytes and treat them as ints. Bytes are always unsigned; setting an element of a bytes array to the value -1 will silently yield a byte with value 255. There are two types of bytes literal. The first is written as a string literal prefixed by a capital B: B"hello"; the individual characters represented in such a literal must each have value below 256. The second is written as a string literal prefixed by a capital X, with the string comprising pairs of hexadecimal digits representing the successive bytes of the constant: X"0102AF" holds the three bytes 0x01, 0x02, 0xAF.

The time type stores an integral representation of time, with microsecond resolution. It is separated from the int type to permit some special-purpose operations. A time literal is written as an integer literal with an (upper case or lower case) T appended, like this: 0T. Also, a valid date, written as a string with a T prefixed to the string, is also a way to specify a string literal, like this:

T"Mon Feb  9 14:55:44.5 PST 2004"

The fingerprint type stores a hash value computed with an implementation-defined hash function. A fingerprint literal is written as an integer literal with an (upper case or lower case) P appended, like this: 0P.

The time and fingerprint types also implement some integer operations but retain their different type under such operations.

Compound types

Through various mechanisms, one can derive compound types analogous to structures and arrays in C. The way these objects are derived and the ways they can be (and cannot be) used are somewhat different from C, however, as will become apparent.

Tuples

The Sawzall analog to a C struct is called a tuple. A later section describes how to declare a tuple, but for now just consider a tuple to be a compound data structure with fields addressable using the dot notation a.b, just as in C:
t.an_int = 1;
t.a_float = 22./7.;
(The arrow notation p->b does not appear).

Unlike in C, tuple expressions are common in Sawzall. For instance, a brace-enclosed, comma separated list of expressions represents a tuple value that may be treated as a unit. For instance, given the tuple t mentioned above, which contains an int and a float, one may assign to t like this:

t = { 1, 22./7. };
We will explore tuples in more detail later in the document.

Arrays

In expressions, the syntax of arrays is identical to that in C. (The declaration syntax, described in a later section, is different.) Arrays are always indexed by integers starting from 0, just like in C. Arrays are indexed using square brackets [] while multi-dimensional arrays are indexed like this: [][]. For example, if a is an array of integers and aa is a two-dimensional array of integers, one may write these statements:
a[0] = 10;
aa[1] = a;
aa[1][2] = a[1];

Arrays may be initialized by assignment to an elementwise-compatible tuple expression, like this:

a = { 1<<0, 1<<1, 1<<2 };
Arrays cannot be resized dynamically, but after creation a subsequent assignment may change its size (and replace all its elements). After the previous assignment, one may reassign the array:
a = { 1, 2, 3, 4, 5 };
Moreover, the subarrays of a multidimensional array may have different sizes, including zero:
aa[0] = { };
aa[1] = { 0 };
aa[2] = { 0, 1 };
aa[3] = { 0, 1, 2 };

Another way to control the size of an array is to use the addition operator, which concatenates two arrays; the following statements all produce the same result:

a = { 1, 2, 3, 4, 5 };
a = { 1, 2, 3 } + { 4, 5 };
a = { 1 } + { 2 } + { 3 } + { 4 } + { 5 };

By analogy with Python, arrays may be `sliced' using the array[min:max] notation, which represents a subarray from index min through max-1 inclusive. Within an index expression or slice, the character $ represents the size of the array at that dimension. Thus a[0:$] represents the entire array while a[$-1] represents the final element.

One can insert subarrays of different size using slice assignment, that is, an assignment with a slice on the left hand side, like this:

a[2:4] = { 11, 12, 13, 14, 15 ,16, 17 }

For arrays, byte arrays, and strings, the function len(array) yields the number of elements of the array, which may be zero. For byte arrays, len will recover the number of bytes in the byte array. For strings, len will recover the number of Unicode characters, not the number of bytes in a byte-level representation of those characters.

Maps

Maps are analogous to dictionaries in Python or associative arrays in Awk. They are a way to store key-value pairs: store a value under a key, retrieve it later by indexing the map with the key. Unlike arrays, maps can be indexed by arbitrary types, not just integers, the indices can be sparse, and they are allocated dynamically so they may grow as you insert elements. Like arrays, however, the key and the value of a given map are each of a single type, such as a map from strings to integers or from floats to arrays of strings.

Map literals are created with a special notation that identifies the key-value pairs. Each element of the map literal contains the key, a colon, and a value, like this example mapping strings to integers:

numbers = { "one": 1, "two": 2, "three": 3, "four": 4 }
As a special case, one can create a map with nothing in it (yet) using the special notation:
numbers = {:};
The colon between the braces distinguishes a map literal from an array literal.

Syntactically, maps look like arrays: the key-value pairs are accessed using indexing operations with square brackets []. (There is no slicing operation for maps.) The numbers map is indexed like this:

numbers["one"]
numbers["two"]
and so on. Indexing a key that is not present in the map results in an `undefined' value, which is described in more detail in the section on Failures and default values.

The function len(map) yields the number of key-value pairs present.

Basic operations

There are a number of built-in operations available for various types in the language; this set will evolve.

Numeric types have the usual C-like operations.

Arrays, including the string and bytes types, overload addition to mean concatenation.

The fingerprint type overloads addition to mean concatenation, that is, the mixing of two fingerprints to yield a new fingerprint.

Conversions

There are also a number of conversions to transform objects from one type to another. The most general conversion is of the form convert(type, value, parameters). The case-insensitive parameters, which may be omitted, vary with the actual conversion and represent things like the base for integer-string conversion and the character encoding for bytes-string conversion. They are described in more detail below. For instance, convert(string, num_events, 16) converts the integer num_events to its hexadecimal representation. If the type is a single word, such as a basic type or defined type name, the syntax may be compressed by using the type name as the `function', as in string("800e4", 16) or time("Wed Jul 30 23:49:55 PDT 2003"). Note these are conversions, not casts: they can change the value, not just the type.

Strings can be converted to and from byte arrays. By default, this operation interprets the byte array as a UTF-8 representation of the string being converted. That is, s = string(bytesarray) stores in s the character string represented by the UTF-8 bytes in bytesarray, while b = bytes(s) converts it back.

Some of the conversion operators accept extra parameters. The numeric conversions accept a base: i = int("123", 16) or s = string(i, 16). A missing base defaults to 0, which, as in strtol, indicates that the string should be interpreted like string constants in C: a leading 0 means octal, 0x means hexadecimal. To avoid surprises, it's best to specify a base, even if it's 0; using 0 explicitly indicates that the choice is deliberate.

Conversion operators between strings and byte arrays accept a string that identifies the character encoding to use (default UTF-8). Supported parameters include "UTF-8", "latin-1", "array-literal", "hex", and "unicode". For example s = string(bytesvar, "latin-1") interprets bytesvar as a Latin-1 sequence and converts it to a Unicode string while b = bytes(str, "latin-1") inverts that transformation.

Similarly, the conversion parameter "unicode" allows one to convert an array of integers representing Unicode code points to a string, and vice versa.

s: string = convert(string, { 0x65e5, 0x672c, 0x8a9e }, "unicode");
# s is now "日本語"
a: array of int = convert(array of int, s, "unicode");
# a is now { 0x65e5, 0x672c, 0x8a9e }

Other bytes-to-string (and reverse) conversions include "hex" which converts the bytes to a single hexadecimal string suitable for pickling binary data and "array-literal" which turns a string into a Sawzall expression for an array literal whose elements are the individual bytes of the input. This program,

b = bytes("fooe\u00FF");  # FF is y-umlaut
a = string(b, "array-literal");
stores in a the string
"{ 0x66, 0x6f, 0x6f, 0x65, 0xc3, 0xbf }"
(Note the UTF-8 encoding in the last two bytes.)

Integers, both signed and unsigned, can also be converted to and from byte arrays. The conversion takes a required parameter specifying how the integer values are encoded into bytes. Supported parameters include "fixed32-big", "fixed32-little", "fixed64-big", "fixed64-little", "saw", "varint", and "zigzag".

The conversion parameters "fixed32-big", "fixed32-little", "fixed64-big", and "fixed64-little" encode 32- or 64-bit integers into 4 or 8 bytes, respectively, using a big or little-endian encoding. The parameter "saw" is an alias for "fixed64-big" (64-bit big-endian encoding), and is the format saw uses for storing integers in sstable output tables.

The "varint" encoding is a variable-length encoding also used by protocol buffers for storing integer values. The "zigzag" encoding is a variant of the varint encoding which encodes negative values more efficiently. It is used in protocol buffer version 2.

i = convert(int, B"\x00\x00\x00\x01", "fixed32-big");
# i is now 1
b = convert(bytes, 2, "fixed64-little");
# b is now B"\x02\x00\x00\x00\x00\x00\x00\x00"
v = convert(int, B"\x80\x01", "varint");
# v is now 128

Fingerprints act like integers in conversion, but yield unsigned 64-bit numbers. Also, they may be converted to and from bytes, in a way that yields a packed 8-byte representation of a 64-bit integer, suitable for loading and unloading string variables in protocol buffers.

Not all conversions are defined but all types (including compound types such as arrays and tuples) can be converted to strings and most can be converted from string representations of their value.

Except when initializing a variable in its declaration, conversion is never implicit in the language; all types must match in expressions.

Operations on time

Time is represented by a numeric type different from int or float. Converting a time variable to a string produces a textual representation of the time in the manner of the C function ctime. Converting a string to a time reverses the transformation while forgiving some sloppiness in the input format. When converting from a string, the seconds value may be expressed with decimal fractions of a second, to specify precision down to a microsecond. The time zone, a string with default value "PST8PDT", may be supplied as a second argument to the conversion operator when converting from time to string, or from string to time when the string does not mention the time zone. Olson identifiers are preferred. The output accounts for Daylight Saving time.

Predefined constants SECOND, MINUTE, and HOUR represent the corresponding interval. Because they tend to appear often, shorter versions are also defined: SEC, MIN, HR. Days, weeks, months and years have variant lengths due to leap seconds and leap days, so a simple constant isn't sufficient for those intervals. Instead, the functions addday, addweek etc. navigate forwards and backwards from a given time. For instance,

t = addyear(t, 1);
advances t one year. The second argument is an integer and may be negative; addyear(t, -1) backs up a year.

Another set of functions extract the components of a time, as integers. The functions secondof, minuteof, hourof(t), monthof(t) and yearof(t) extract the numeric hour, month, or year, while dayofweek(t), dayofmonth(t) and dayofyear(t) do the equivalent for the day. (Monday is day 1, while Sunday is day 7). All these are 1-origin except secondof, minuteof and hourof, which are by convention 0-indexed (midnight is 00:00:00).

For indexing purposes, it's often useful to truncate a time to some boundary to define an hour or day in which the time occurs. For an hourly interval, one could write t-(t%HOUR) to move a time back to the start of the hour, but this is clumsy and will not work for months or years. A set of functions performs this truncation: trunctosecond, trunctominute, trunctohour, trunctoday, trunctoweek, trunctomonth, and trunctoyear. The return type of these functions is still a time, so it may still be compared with other time values. For instance,

string(trunctoday(time("Sun Feb  2 23:40:08 PST 2003")))
will yield the string
Sun Feb  2 00:00:00 PST 2003
By analogy with the conversion operators, a time zone may be specified for the truncation; the default is "PST8PDT" (which may generate a time in PST or PDT depending on the time of year). For instance, if we make the time zone "America/New_York" explicit in the truncation above,
trunctoday(time("Sun Feb  2 23:40:08 PST 2003"), "America/New_York")
the resulting time is
Sun Feb  2 21:00:00 PST 2003
All the time-modifying functions accept an optional time zone for interpreting the result. The same applies to the addX and Xof functions for time as well. The general rule: the time value is converted to local time, the operation performed, and then (if necessary) the time value is converted back to the epoch. Sawzall's time zone support is based on ICU. You can use any ICU-supported time zone name but the "America/Los_Angeles" style is preferred.

Finally, the function now() returns the time at the instant it is called.

Intrinsics

There are a variety of built-in functions that operate on basic types, things such as min and max. One unusual operation is available for strings: load(file) evaluates to the contents of the named file as a single byte array. A list of basic intrinsics appears below; more detail -- with more intrinsics -- is available in another document. Also, the --explain option to the Sawzall interpreter prints useful information about the intrinsic named by its argument, or lists all the intrinsics if given an explicitly empty argument (--explain=).

In the following table, the notation [,TZ] is shorthand for an optional string-valued time zone specifier, written either in the Olson "America/Los_Angeles" style (encouraged), traditional "EST" style, the Unix "EST5EDT" style, or the offset "-500" style.
Basic builtin functions
Signature Return type Description
abs(i:int (or float)) int (or float) return absolute value
addday(t: time, i: int [,TZ]) time advance t by i (possibly negative) days
addmonth(t: time, i: int [,TZ]) time advance t by i (possibly negative) months
addweek(t: time, i: int [,TZ]) time advance t by i (possibly negative) weeks
addyear(t: time, i: int [,TZ]) time advance t by i (possibly negative) years
assert(cond: bool [, s: string]) N/A if cond is false, exit with an informative message incorporating s if present
bytesfind(lit: bytes, byt: bytes) int return the position of the first match of lit; -1 for no match.
bytesrfind(lit: bytes, byt: bytes) int return the position of the last match of lit; -1 for no match.
dayofmonth(t: time [,TZ]) int return day of month, 1-indexed
dayofweek(t: time [,TZ]) int return day of week; Monday is 1, Sunday is 7
dayofyear(t: time [,TZ]) int return day of year, January 1 is 1
fingerprintof(x: T) fingerprint return the fingerprint of x, which may be of any type
format(fmt: string, ...) string formatted print in the manner of smprintf(fmt, ...)
getenv(v: string) string the value of the environment variable named by v; undef if variable does not exist
getresourcestats() resourcestats report resource usage statistics for the program. Run szl --explain resourcestats and szl --explain getresourcestats for details.
haskey(m: map[T1] of T2, key: T1) bool return a bool indicating whether the key is present in the map
highbit(x: int) int return the bit position of the most-significant bit, starting from 1. highbit(0) is 0
hourof(t: time [,TZ]) int return hour within the day; midnight is 0
inproto(field: T) bool return a bool indicating whether field was present in the original proto buffer.
keys(m: map[T1] of T2) array of T1 return an array containing all the keys present in the map.
len(array of T) int number of elements of array, characters in string, bytes in bytes, keys in map
load(file: string) bytes return contents of named file
lookup(m: map[T1] of T2, key: T1, default_value: T2) T2 return the map element indexed by the key; if there is no such element, return the indicated default value
lowercase(s: string) string return s with upper case characters converted to lower case; handles Unicode
match(pat: string, str: string) bool does str match regular expression pat?
matchposns(pat: string, str: string) array of int return array of positions of matches to pat as pairs of indices; [0] and [1] are match of entire string; subsequent pairs describe parenthesized subexpressions; empty array is no match
matchstrs(pat: string, str: string) array of string return array of strings matched by pat; [0] is match of entire string; subsequent strings describe parenthesized subexpressions; empty array is no match
max(a: T, b: T) T maximum of a and b (T is any basic type)
min(a: T, b: T) T minimum of a and b (T is any basic type)
minuteof(t: time [,TZ]) int return minute within the hour, range is 0-59
monthof(t: time [,TZ]) int return month within the year; January is 1
nrand(n: int) int random integer in the range 0 <=x<=n-1
now() time return current time
rand() float random floating point number in the range 0.0<=x<1.0
secondof(t: time [,TZ]) int return second within the minute, range is 0-59
sort(a: array of basic_type) array of basic_type return the sorted version of an array. Only scalar values can be sorted. Values will be arranged in increasing order.
sortx(a: array of basic_type) array of int return the index vector that sorts an array. Only scalar values can be sorted. The index vector arranges array values in increasing order.
splitcsv(csv: bytes, fields: array of int) array of bytes return a list of requested fields from each comma-separated line of the specified string (represented as bytes); for example, splitcsv(B"a,b,c\naa,bb,cc", { 1, 3 }) returns { B"a", B"c", B"aa", B"cc" }
splitcsvline(csv: bytes) array of bytes splits the comma-separated string (represented as bytes); for example, splitcsvline(B"a,b,c") returns{ B"a", B"b", B"c" }
strfind(lit: string, str: string) int return the position of the first match of lit; -1 for no match.
strrfind(lit: string, str: string) int return the position of the last match of lit; -1 for no match.
strreplace(str: string, lit: string, rep: string, replace_all: bool) string return a copy of string str, with non-overlapping instances of lit replaced by rep; if replace_all is false, only the first found instance is replaced.
trunctoday(t: time [,TZ]) time truncate to start of day
trunctohour(t: time [,TZ]) time truncate to start of hour
trunctominute(t: time [,TZ]) time truncate to start of minute
trunctomonth(t: time [,TZ]) time truncate to start of month
trunctosecond(t: time [,TZ]) time truncate to start of second
trunctoyear(t: time [,TZ]) time truncate to start of year
uppercase(s: string) string return s with lower case characters converted to upper case; handles Unicode
yearof(t: time [,TZ]) int return year
Floating point math builtin functions
Signature Return type Description
ln: function(x: float) float the natural logarithm function
log10: function(x: float) float the logarithm base 10 function
exp: function(x: float) float return ex
sqrt: function(x: float) float the square root function
pow: function(x: float, y: float) float The exponential, base x, of y
sin: function(x: float) float the sine function, argument in radians
cos: function(x: float) float the cosine function, argument in radians
tan: function(x: float) float the tangent function, argument in radians
asin: function(x: float) float the arc sine function
acos: function(x: float) float the arc cosine function
atan: function(x: float) float the arc tangent function
atan2: function(x: float, y: float) float the arc tangent of y/x
cosh: function(x: float) float the hyperbolic cosine function
sinh: function(x: float) float the hyperbolic sine function
tanh: function(x: float) float the hyperbolic tangent function
acosh: function(x: float) float the hyperbolic arc cosine function
asinh: function(x: float) float the hyperbolic arc sine function
atanh: function(x: float) float the hyperbolic arc tangent function
fabs: function(x: float) float the absolute value function
ceil: function(x: float) float round x up to the nearest integer
floor: function(x: float) float round x down to the nearest integer
round: function(x: float) float round x to the nearest integer, but round halfway cases away from zero
trunc: function(x: float) float round x to the nearest integer not larger in absolute value
isnan: function(x: float) bool tests if a float value is an IEEE NaN
isinf: function(x: float) bool tests if a float value is an IEEE Inf
isfinite: function(x: float) bool tests if a float value is not +-Inf or NaN
isnormal: function(x: float) bool tests if a float value is neither zero, subnormal, Inf, nor NaN

Formatted output

There are several mechanisms in the language to produce formatted output, in the manner of C's printf() function. Although in principle one could use convert(string, value) to generate such output, the printf style allows more flexibility and convenience.

The formatted print operator in Sawzall is called format and it produces a string; it does not directly write output to a file (more on that in a moment). Its behavior is very much like an sprintf that allocates its own storage; it returns the string as a value. Here is an example:

str = format("Hello, world in the year %d\n", yearof(now()));
To write to a file, use the emit statement, which is described in detail later in this document. For writing to standard output or standard error, the predefined variables stdout and stderr can be used:
emit stdout <- format("Hello, world in the year %d", yearof(now()));
(No terminal newline is necessary since emit to stdout adds a newline automatically.)

The syntax of the format string is similar to that of ANSI C's printf - for complete documentation see szl --explain format.

Simple declarations

There are several places in a program where variables and type names must be declared. This section describes the basic syntax from which the various forms of declaration derive.

A fully qualified declaration includes the name of the object being declared, a colon, and a type specifier. The simplest type specifier is a basic type, so a declaration of a integer variable would look like this:

i: int;
An array or other indexed object is introduced using the appropriate keyword (array, sum, collection, top, or unique) and with the indexing and value types explicit. For instance,
a: table sum[string] of int;
declares a variable a to be an aggregator of integers, indexed by strings. In such a declaration, it may be helpful to give names to the indices and value, like this:
a: table sum[value: string] of count: int;
Such names are just commentary, however and are not defined for use outside the declaration statement itself (the only exception being some modifiers for output types described below). Therefore, these two versions of the declaration of a are equivalent.

A few more points about indexed types. The array type is a fairly ordinary array data type, basically like that in C. It may be indexed by integers only. The other indexed types (sum, collection, top, unique) are forms of output variables and have very different semantics, as described below. Output variables may be indexed by any basic type.

Arrays are declared without an explicit size, although a size may be set by the initializer, to be described shortly. Here are a couple of simple array declarations:

a: array of int;
aa: array of array of int;
Note that a multidimensional array is literally and explicitly an array of arrays.

Map types are declared by specifying the type of both the key and the value, like this:

zipcodemap: map[city: string] of zipcodes: array of string;
As always, the names of the key and value are optional in the declaration.

A tuple type is declared by zero or more types gathered together in braces. An example:

t: { int, float };

Tuple components are called fields and can be given names

t: { count: int, revenue: float };
to be used to access the fields (t.count, t.revenue).

For convenience, one may define and use type names. This example is analogous to the declaration of t above:

type Info = { count: int, revenue: float };
t: Info;

With the exception that arrays can only be indexed by ints, a tuple can be used as the element or index of any indexed type. Thus one may declare an output table like this:

adcount: table sum[creative: string][{ hour: time, tz: string }] of
         { count: int, revenue: float };

Tuple types can nest:

tt: { a: int, b: { c: float } };
This declaration creates a tuple with fields tt.a, tt.b and the nested field tt.b.c.

We can also declare a nested type within a named tuple:

type Data = { tag: string, type Info = { count: int, revenue: float }, i: Info };
After this declaration, one can declare variables outside the tuple using the type name Data.Info. Note that to access a nested type, we name the subfield of the type (such as Data.Info), while to access a nested value, we name the subfield of a value (such as tt.a). (The surrounding tuple type must have a name so the nested type can itself be named.)

Tuples can therefore serve as name spaces.

Type equivalence is defined as follows. Two basic types are equivalent if they are identical. Two array types are equivalent if their element types are equivalent. Two maps types are equivalent if their key types and value types are equivalent. Two function types are equivalent if their parameter types and return types are equivalent. Two tuple types are equivalent if their field types are equivalent and the corresponding fields in the two tuples are named the same.

Initializations

Variables may be initialized when they are declared:
twoplustwo: int = 5;
PI: float = 3.14159265358979;
pi_str: string = string(PI);
As a convenience, to avoid Java-like stuttering of type names, Sawzall injects an implicit conversion if the initializing expression is of a different type and can be converted. Thus the third declaration above could be equivalently written,
pi_str: string = PI;
To initialize arrays, tuples, and maps, use a list of expressions in braces:
a: array of int = { 1, 2, 3, 4 };
aa: array of array of int = { a, { 5, 6, 7, 8 }, a[0:2] };
t: { i: int, f: float, s: string } = { 3, 3.0, "three" };
m: map[string] of array of string = { "us": { "en" }, "ca": { "en", "fr" } };

There is another convenience. If the type of the initializer is the same as the type of the variable to be declared, the type can be elided from the initialization. The idiom is to use the := notation, as in these initializations:

i := 0;
f := 3.14159;
t := now();
Note that these are declarations with initialization, not assignments, despite the lexical similarity to the := operator in Pascal. This form of initialization is only available if the expression is of unambiguous type. For instance, this declaration is illegal:
a := { 1, 2, 3, 4 }
because it is unclear whether the expression is an array or a tuple.

As we explore more of the language, we'll see that initializations play a larger part in Sawzall than they do in most languages.

Input declarations

The execution of a Sawzall program takes each input record, either a protocol buffer or a line of text, matches it against a set of patterns, and performs actions according to successful matches. A program usually begins with a set of input declarations to define the format of the input records and how its components are to be interpreted during the execution of the program. These declarations break the input record into parsed components that are assigned to the variables named in the declarations. The input declarations therefore serve as a sort of parser for the input record, breaking it into pieces that are assigned to variables that can be used elsewhere in the program. The values of these variables are defined anew for each input record processed by the program.

A common starting point, if the input is in protocol buffer format, is to define a type by importing the description of the protocol buffer into a Sawzall type. There are two ways to achieve this: a proto directive, or a declaration that imports the definition from the protocol database.

The proto directive

The proto directive is a variant of an include directive and looks like this:
proto "szluser/logdata.proto"
The result of such a directive is to define types and constants describing the contents of the protocol buffer, usually as a tuple. In fact, what happens is that the directive is replaced by the output of running the protocol compiler on the named file, requesting Sawzall output.. The names of the types are found in the .proto file, so it may be necessary to examine it to see what names to use. If there are multiple protocol buffers named in the .proto file, all the types will be defined.

Values in protocol buffers

The types of the tuple fields are extracted as Sawzall types corresponding to the C++ types in the protocol buffers: int, bytes, etc. In particular, strings in protocol buffers become byte arrays when imported; it is necessary to convert them to Sawzall string variables explicitly because Sawzall strings are Unicode and the protocol buffers do not define the character encoding.

After the declaration listed above, one may declare a new variable, say logrecord, with type LogDataProto, and initialize it to the predefined byte array input in order to extract the fields of the input record:

logrecord: LogDataProto = LogDataProto(input);
After this declaration, for each input record, the fields of logrecord are set to the fields of the input record. For instance, if LogDataProto contains a field value that is a string, it is stored in the byte array logrecord.value automatically. (As with .proto specs translated into C++, field variables become lower case.)

In initializations the conversion operator can be elided since the type of the variable being declared is explicit, so this declaration could be written in the shorter form

logrecord: LogDataProto = input;
We may wish to regard the value as a regular string containing Unicode characters. The data imported from the protocol buffer is just an array of bytes, but the value field of LogDataProto is always the UTF-8 representation of the Unicode characters in the data for value in the protocol buffer. We can therefore convert the byte array logrecord.value into a Unicode Sawzall string variable by converting it explicitly in another input declaration, like this:
unicodevalue: string = string(logrecord.value, "UTF-8");
Because UTF-8 is the default input encoding and because initializations implement conversions automatically, we could write this in shorter form:
unicodevalue: string = logrecord.value;
or equivalently
unicodevalue := string(logrecord.value);
If some other field were in a different character encoding, it would be necessary to define the encoding. For instance, imagine that there was another field, logrecord.nativevalue in the native encoding defined by logrecord.nativeencoding; one could write
unicodevalue := string(logrecord.nativevalue, string(logrecord.nativeencoding));

If the input format is not defined by a .proto file, it is still possible to represent the format using a Sawzall type declaration. For instance, we could define the LogDataProto type explicitly, like this (omitting the parts not used above):

type LogDataProto = {
  value: bytes,
  nativevalue: bytes,
  nativeencoding: bytes
};
The proto directive is just an automatic way to construct such a type from the external .proto definition.

Promotion to proto

In Sawzall, a protocol buffer type is just a tuple with some internal annotations to specify how it is to be imported and exported. For the most part, these are invisible to the programmer. To see what these annotations look like, run the protocol compiler with the Sawzall plugin on a .proto file; the details are beyond the scope of this document.

Sometimes it is useful to create a new tuple type and treat it as a protocol buffer. To do this, use the keyword proto yet again:

type T = { a: int, b: float };
type protoT = proto T;
or just
type protoT = proto { a: int, b: float };
If the type is already has protocol buffer annotations, this construct adds nothing new; proto is idempotent.

This program will illustrate some of the mechanism; we leave its interpretation as an exercise for the curious reader.

type T = { a: int, b: float };
type protoT = proto T;

t: T = { 0, 0.0 };
protot: protoT = { 0, 0.0 };

emit stdout <- format("%#T", t);
emit stdout <- format("%#T", protot);

The saw operation

Suppose the value field string in a LogDataProto record could consist of several words separated by plus signs. We can use an input declaration to parse field into its constituent words. To see how, we need to explore a language operation called saw that slices a string into pieces according to a sequence of regular expressions that match successive pieces of the string.

In its simplest form, saw is a function that takes as arguments a string to slice and a series of regular expressions to use to find the pieces to slice out, and returns an array of string containing the matched substrings. After this statement,

a := saw("abcdef", "...", "..", ".");
len(a) is 3 and a[0] contains "abc", a[1] contains "de", and a[2] contains "f". Sawzall uses Perl Compatible Regular Expressions (PCRE).

If the patterns are insufficient to span the entire string, the returned array will contain only those substrings that matched. For instance, after

a := saw("abcdef", "abc", "de", "g");
a will have only two elements, "abc" and "de". As a result, not only one can use the length of the return value to determine the success of the operation, one can use saw speculatively.

The substrings matched by a saw do not necessarily abut, that is, the regular expressions search for the next match as well as extract it. For example,

a := saw("abcdef", "abc", "e", "f");
will return three substrings and skip the d. If you want the regular expressions to match abutting substrings with no gaps, `anchor' the search using the regular expression ^ operator. Compare the previous example with this one,
a := saw("abcdef", "abc", "^e", "f");
which will only return one substring, "abc", because the character after "abc" is not 'e'.

One way to think of how saw works is that, for each pattern, it searches from the current point in the string until it finds a match, slices out the substring that matches, and advances the current point.

As a more interesting example, we could break apart a string which has airport codes as three colon-separated subfields, something like "SFO:SJC:BOS". This could be sawn into pieces:

strs := saw(language, `[^:]+`, `[^:]+`, `[^:]+`);
(Note the use of the `` quote characters; it's a good idea to use them for regular expressions to avoid surprises with backslashes.) The result is a 3-element array, each element containing an element of the string, "SFO", "SJC", "BOS". We can guarantee that the string parses absolutely correctly by anchoring the searches and including the colons:
strs := saw(language, `^[^:]+`, `^:`, `^[^:]+`, `^:`, `^[^:]+$`);
if (len(strs) != 5) error...
(The regular expression syntax can be a little cryptic; here ^ means two different things, anchoring the match and complementing a character class. For the rest of this document, it is assumed you understand RE syntax; if you need a refresher course, see the appropriate man page.) In such cases, it's annoying to have the colons appear in the destination string, since they're just separators. We can tell saw to avoid copying them into the destination array, while still requiring that they parse correctly, by placing the skip keyword before the separator patterns:
strs := saw(language, `^[^:]+`, skip `^:`, `^[^:]+`, skip `^:`, `^[^:]+`);
if (len(strs) != 3) error...

As another example, imagine we had a string nums that consisted of a couple of integers separated by a space. We could pull out the integers like this:

strs: array of string = saw(str, `[0-9]+`, skip ` `, `[0-9]+`);
x: { i: int, j: int } = { int(strs[0]), int(strs[1]) };
This is a little clumsy, but initializations provide the necessary type conversions, and the saw operation will skip the blank while scanning for the second number, so we could abbreviate this process considerably:
x: { i: int, j: int } = saw(str, `[0-9]+`, `[0-9]+`);
(The next section explains what happens to x in this example if the saw returns fewer than two elements.)

Regular expressions can contain parenthesized subexpressions, and although by default the entire string matched by the expression is extracted by saw, sometimes it's useful to pull out the subexpressions instead. To do this, prefix the pattern with the word submatch, as in this example to parse a line of a CSV file consisting of two fields per line, the first of which may contain embedded commas in a quoted string:

saw(s, submatch `(.*),`, `.*`)
In this example, the idea is that the .* pattern in the first expression will be greedy and absorb all commas, but the final comma in the pattern forces it to stop there; using the parenthesized subexpression allows us to extract the string up to but not including the final comma on the line. (But see the splitcsvline builtin to handle nested quotes.)

To clarify parsing of basic types, the intrinsic operator regex produces a regular expression that matches a type. Thus we could match an integer and a float like this:

x: { i: int, x: float } = saw(str, regex(int), regex(float));
With array of string as the destination type, of course, anything matches, so it is almost always necessary to provide explicit separators or unambiguous patterns when cutting strings into strings, as in the language example above. When the destination type is int, an optional numeric base may be supplied, as in regex(int, 16); the default is 0, which uses the ANSI C rules for recognizing integer constants.

The saw operation consumes only as much of the input string as is required to match the pattern. There are two looping variants of saw. The first, sawn, takes a count that specifies how many times to run through the parameter list. We could use it for the language example like this:

sawn(3, language, `[^:]+`, skip ":");
On the last run of the loop, a final skip parameter needn't be matched, so sawn can be applied to text, like the language field, that is split by separators rather than terminators.

The second variant, sawzall, loops through the patterns until it runs out of input. For instance, to break up a line of integers, we could use

sawzall(line, regex(int))
and to split tab-separated fields
sawzall(line, "[^\t]+")
(Note: this one needs double quotes.)

[Empty result strings cause some complications. You can skip this paragraph if you avoid regular expressions that generate empty strings; as the next paragraph says, that's almost always possible. Some comments about empty result strings: First, when a saw makes no progress advancing across the string, it returns an empty array. For instance,

x: array of string = saw("XYZ", `Z*`, `Z*`, `Z*`);
will return an empty array, even though all the patterns match, because they all match at the beginning of the string. This property is a safety check against looping on bad data and it provides a termination condition to sawzall when it reaches the end of the input. For similar reasons, sawn (with a count greater than 1) and sawzall treat empty result strings specially. If an empty string is matched at the same location as the end of the previous match, it is discarded, the position in the input string is advanced one character, and scanning resumes. A couple of examples:
x: array of string = sawzall("XYZ", `Z*`);
will return three matches (empty at 0, empty at 1, "Z"); as will
x: array of string = saw("XYZ", `Z*`, `Z*`, `Z`);
(but they are: empty at 0, empty at 0 (not 1), "Z").]

Your best bet is to avoid empty matches when possible, which it usually is. Use the + closure (one or more) rather than the * closure (zero or more) when appropriate, or make sure your arguments contain at least some literal data. For example, to match an integer, write "[0-9]+" or "\\d+" or `\d+` (using the Perl digit operator; in double quoted strings it's necessary to protect the backslash) or regex(int), not `[0-9]*`, which can match the empty string and confuse the parsing of erroneous input. To break a string into lines, use sawzall(str, ".*\n") or sawzall(str, `.*`, skip "\n"), not sawzall(str, `.*`). For fields separated by spaces, use "[\t ]+" or `\s+` (the Perl white space operator) rather than "[\t ]*".

Sometimes it's useful to grab the remainder of the string after splitting off some piece. To do this, add as a `pattern' the name of a pre-declared variable preceded by the keyword, rest. For instance, to extract three integers from the front of str and advance str, we could write

sawn(3, str, regex(int), rest str);
A rest may appear anywhere in a saw operation but may only be the final pattern in a sawn or sawzall; for these looping operators, the string is saved after the rest of the processing is done.

It's worth being precise about the way sawn and sawzall work, and now that we have the rest notion we can express their behavior more carefully. The sawn operation works schematically as follows. Imagine the operation is written sawn(n, str, X, rest r), where X represents the patterns of the expression. This is equivalent to saw(str, XXXX(n times), rest r), with the proviso that a final skip parameter may be ignored in the rewrite. Because sawzall does not have a fixed count, it cannot be implemented this way. Instead, it works iteratively as long as it makes progress, something like this:

tempstr = str;
while (making progress)
  append to array the result of saw(tempstr, X, rest tempstr);
r = tempstr;

Returning to our airport code example, we can use saw to do the work. To make things clear, let's define a type:

type Codes = array of string;
and use it in conjunction with an invocation of sawzall:
airports: Codes = sawzall(airportcodes_field, "[^+]+");

Failures and default values

Saw operations may not find a match. Numerical errors may cause zero division or overflow. Array indices may fall out-of-bounds. Map keys may be not present in the map. These and other problems will result in conditions for which the value of variables may be unspecified. The Sawzall language defines the behavior of the system in such cases.

The approach is for the system to track, for each input record, which variables have well-defined values. This definedness property propagates through the various input declarations, expressions, and other statements as they execute. Whenever a declaration or pattern/action group depends on undefined values, it is ignored, with the exception of variable declarations, function return values, and arguments to the built-in function def() (see below). If an undefined value is used to initialize a variable in its declaration, the variable is marked as undefined (i.e., the assignment is ignored). If the return value of a function is undefined, the result of the function call is undefined. In effect, the property of ignoring undefined values means that for each input record, the Sawzall program is trimmed to the subset of its execution that can be defined, given the input.

The granularity of the trimming is that, within the block that evaluates an undefined variable or expression, all subsequent top-level statements whose execution could depend on that variable are elided.

Although intended primarily to make processing of large data sets robust in the face of ill-formed data, this property can be exploited for the programmer's convenience. One could write a variety of speculative input declarations, only some of which would ever be valid for a given input record, without any need for error checking or bookkeeping. The actions they trigger will only be executed if the speculation was successful.

Since this behavior can mask bugs, there is a global flag (ignore_undefs) to the interpreter to enable it. When not set, undefined values trigger a run-time error.

For occasions when one wants more fine-grained control, there is a boolean primitive def(expression) to check whether an expression has well-defined value. This primitive `protects' the inner expression from needing to be valid for its surrounding block to be executed. As a simple example, given a map m, the if statement

if (def(m[key])) ...
tests whether key is present in the map (although it's faster to use the haskey or lookup builtin for this). Functions often return undefined values to indicate errors. As an example, getenv() returns undefined if the named shell variable does not exist in the environment. One may therefore test whether a variable exists by writing
if (def(getenv("VARIABLE"))) ...

Initializers and static values

So far, most of the examples we've looked at derive from the input line. It is often useful, however, to define variables whose values do not depend on the input. Syntactically, it's just the same:
message: string = "Hello, world\n";
or
message := "Hello, world\n";
The values of such initialized data will be evaluated, just like regular input variables, once for each record. For true constants, define the data to be statically determined:
static kPowersOf2: array of int = { 1, 2, 4, 8 };
The static keyword states that the variable will be created and evaluated once at the start of execution; not once for each record. Of course, the data involved in a static definition must also be statically determined or an error will result.

We can construct static parses of data, as in this silly example:

static kPowersOf2: array of int = sawzall("1 2 4 8", regex(int));
This mechanism provides a way to initialize a table from a free-format table of input. This example initializes a table from a list of names, one per line:
static kDivas := sawzall(load("/home/szlusers/divas"), `.+`);
(The period regular expression metacharacter does not match a newline.)

Static variables may be initialized only by expressions, not for instance by statements such as for loops, but those expressions can represent arbitrarily complex calculations because they may include function calls. Functions used in static initializers must themselves be declared as static, as explained below.

It is legal to create a static variable as a field of a named tuple type:

type Buffer = { data: string, static kSize := 1024 };
Such a variable is considered part of the type, so outside the tuple it would be named Buffer.SIZE. It is not part of a value of that type, so if we had a variable buf of type Buffer, we could write Buffer.SIZE but not buf.SIZE.

Output declarations

Output variables are variables of an output type. They aggregate the data produced by the execution of the program. Output types are introduced with the table keyword followed by a name identifying the kind of output type desired. Sawzall is extensible with respect to output types. Each output type aggregates, according to its individual definition, the data that is output to that table. Sawzall currently supports:
  1. collection: A simple collection or concatenation of the data.
  2. sample: A statistical sampling of N items.
  3. sum: An arithmetic sum of the data.
  4. top: Statistical samplings that record the `top N' data items.
  5. maximum: A precise sample of the N highest-weighted data items.
  6. minimum: A precise sample of the N lowest-weighted data items.
  7. unique: Statistical estimators for the total number of unique data items.
  8. set: A set of size at most N. Larger sets are discarded.
  9. quantile: Approximate quantiles (actually N-tiles) for data items from an ordered domain.
  10. distinctsample: A uniform sample of a given size from a set of all values seen.
  11. inversehistogram: An approximate histogram of unique values.
  12. weightedsample: A sample of a given size from a set of all values seen, biased towards values with higher weights.
  13. recordio: An unindexed collection written directly to a binary file.
  14. text: An unindexed collection written directly to a plain file.
  15. mrcounter: An integer counter that can be used to provide an accumulated count (e.g. a progress indicator) to a C++ program invoking the Sawzall interpreter.

The basic idea behind all the output types is that they gather and process data and store the result in pigeonholes indexed by the indices of each type. If an output variable has no indices, it represents a singleton. Unlike regular arrays, indexed output variables must have the type of the index defined explicitly in the declaration (it's always int for arrays). For example,

numdivasfans: table sum[string] of int;
is an output variable indexed by strings. Each entry stores an integer. Each time an index is mentioned, the cell to store its value will be created if it does not exist, much like an Awk associative array or Python dictionary. Subsequently (by mechanisms outside the language; see the mapreduce_demo_unittest example program) the data for, say, index "britney" can be recovered again by accessing numdivafans["britney"]. On the other hand,
numbritney: table sum of int;
represents a single value, accessed by numbritney with no index.

Here are some suggestive examples. The declaration syntax is straightforward, except that some tables take parameters and some take a special weight attribute. These are described in the sections that follow.

# Collect all the divas into a single stream.
alldivas: table collection of string;

# Count the number of times each diva is seen.
numdivas: table sum[diva: string] of count: int;

# Record the top 10 divas (by fan count) for each state
topdivas: table top(10)[state: string] of diva: string weight count: int;

# Record the ten best selling divas
bigmoney: table maximum(10) of diva: string weight sales: float;

# Count how many unique lyric phrases using an (internal) sampled table of
# 10000 entries to estimate the distribution:
uniquephrases: table unique(10000) of phrase: string;

# Gather song titles for those divas for which there are not more than
# 10 distinct titles
session: table set(10)[diva: string] of title: string;

# For every diva, gather approximate quantiles for the ages of the fans
fanagequantiles: table quantile(10)[diva: string] of fan_age: int;

In the current implementation, the elements of output variables may be tuples but may not be arrays or contain arrays. For instance, the following is not supported:

s: table sum[int] of array of float;  # unsupported
The next sections describe the output types in more detail, but first we need to see how they are used.

Actions 1: The emit statement

The only thing one may do to an output variable is to use it in an emit statement in an action. The syntax of an emit statement is simple: the keyword emit followed by a fully-indexed output variable reference, followed by an arrow and the complete value to record there, like this:
emit numdivafans["britney"] <- 1;
This statement says to aggregate the integer 1 to the table numdivasfans indexed by the string "britney". Since numdivafans is a sum, the effect is to add 1 to the count of "britney" fans. Other values may be worth aggregating in this fashion, for example:
sales: table sum[int] of float;
...
emit sales[diva] <- 0.25;
or even tuples:
fileoffans: table collection[decade: int] of { string, string };
...
emit fileoffans[fan_age / 10] <- { diva, song };
Now let's examine the various output types in more detail.

Collection

A collection variable aggregates each data item under its index into a continuous stream of data. The result is a historical record of the values emitted under each index, but without any order guarantees.

Sum

A sum variable accumulates the arithmetic sum of each of its components, elementwise. The elements may be only int or float or tuples created from them. This declaration defines an hourly count of number of albums and sales, indexed by diva and seller:
salesperhour: table sum[diva: string][seller: string][hour: time]
  of { count: int, sales: float };

Top

A top gathers an approximate list of the most popular items. The size of the list must be defined in the declaration, along with the 'weight' used to decide which entries are the `top'. For example,
top100songs: table top(100) of song: string weight int;
This example estimates the most popular song titles based on the number of times each distinct title is emitted to the object, assuming the corresponding emit statement uses a weight of one:
emit top100songtitles <- title weight 1;

We may weight by something that is not a constant, such as the number of times a song is played as logged in a given record:

top100songs: table top(100) of song_id: int weight playcount: int;

Note that top guesses the most popular items and estimates their weights. To compute the exact answer, run two programs, the first computing the totals using a sum, and the second selecting the items using a maximum. For example, one can find the 100 most popular songs exactly. First run program 1:

playcounts: table sum[song: string] of count: int;
...
emit playcounts[song] <- 1;
then apply program 2 to the output of program 1:
top100songs: table max(100) of song: string weight int;
# song and count retrieved from playcounts in program 1.
emit top100songs <- song weight count;
The result will be the precisely tabulated top 100, but requires producing a large intermediate table of exact counts.

The quality of the estimates generated by top depends on the distribution of items in the input, and is hard to predict in advance. It was designed to work well for zipfian distributions, but will perform poorly for flat distributions. In addition, merging multiple runs of a top table may give slightly inferior results to running on the entire data set. The output for a top table includes the item values and weights, along with an estimated deviation of the weights. If the deviation is significantly smaller than the weights, then the answers are probably reasonably accurate. However, if the deviation is large, the answers may be wildly off.

The implementation of top is limited to 1000 items.

Maximum and Minimum

A maximum records the values that have the highest weight. It is syntactically similar to a top, but unlike a top table, maximum is precise, not statistical, and makes its selection based on individual values and their weights, rather than by summed counts over a set of related values. For instance,
t: table top(10) of song: string weight int;
will estimate which are the 10 most popular songs based on how many times they appear, but
m: table maximum(10) of song: string weight int;
will look at the list of songs, weighted by their integer weight, and store the ten song names whose individual weight (count) was the largest. The same song emitted multiple times to a top will climb in the rankings, but the only way a song in a maximum table will get to the top is if some single emit statement with that value has a very large weight.

A minimum table is equivalent to a maximum table with inverted weights.

Unique

A unique uses a statistical algorithm to estimate how many unique items occur. They are evaluated by examining the density of a sampling table whose size must be provided as a parameter.
uniquephrases: table unique(1000000) of phrase: string;
The value stored in a unique is just an integer representing the estimate of the number. If the value is a tuple, the entire tuple must be unique to count as a separate value:
uniquedivaphrase: table unique(10000) of { diva: string, phrase: string };

Set

A set records up to N distinct values for each index. If an index would have more than N elements, it and its values are discarded, so sets of more than N elements are ignored. This design avoids deciding which elements to keep if the set is large. For a statistical sample of N elements, a sample table or top table would be more appropriate.
songs: table set(120)[diva: string] of { song: string, length: int };
accumulates the songs for each diva, but only if there are no more than 120.

Quantile

Quantiles provide a way to represent the underlying data distribution in a succinct manner. If there are M data elements (emits to an index in the table) their N-tiles (N>1) are N elements with ranks 1, M/(N-1), 2*M/(N-1), ..., (N-3)*M/(N-1), (N-2)*M/(N-1), M in the sorted list of all the elements. However, the quantile table in Szl computes an approximate version of these N-tiles. Let X1, X2, ... XN be the elements output by the quantile table. The first and last elements ( X1, XN ) are accurate and represent elements at rank 1 (min) and M (max) respectively, where M , as before, is the total number of emits to the index. However, all other elements may have ranks (in the sorted list) off by at most M/(N-1) relative to their desired rank. More precisely, the Ith element (1 < I < N ) XI has rank I*M/(N-1) +/- M/(N-1) when ideally it should have a rank I*M/(N-1) . The output of the quantile table is guaranteed to be in sorted order. For example,
fanagequantile: table quantile(5)[diva: string] of age: int;
computes approximate quantiles of the number of fans logged for each divas.

DistinctSample

The table distinctsample takes a uniform sample of a given size from the set of all values seen. Conceptually that means first removing duplicate copies of all values that occur more than once in the input, and then taking a sample without replacement from the resulting duplicate-free data set. For example,
my_sample: table distinctsample(100) of phrase: string weight int; emit
my_sample <- logrecord.phrase weight 1;
picks a sample of 100 song phrases, each phrase being chosen with equal probability, regardless of its multiplicity. In addition, the distinctsample aggregator keeps track of the number of times each sampled phrase appears in the data set.

InverseHistogram

The inversehistogram aggregator produces an inverse histogram, that is, a table saying how many values occur with aggregate weight 1, 2, 3, ... and so on. The inverse histogram is approximate, with accuracy specified by a parameter. The numerical parameter is how many samples to use to generate the histogram. For example,
my_histogram: table inversehistogram(50000) of phrases: string weight count: int;

if (some condition) {
  my_histogram <- logrecord.phrase weight 1;
}
might produce output that looks like this:
my_histogram[] = 0, 257610.69575674
my_histogram[] = 1, 0.86072
my_histogram[] = 2, 0.11002
my_histogram[] = 3, 0.01918
my_histogram[] = 4, 0.0052
my_histogram[] = 5, 0.00228
my_histogram[] = 6, 0.00092
my_histogram[] = 7, 0.00054
my_histogram[] = 8, 0.00038
my_histogram[] = 9, 0.00018
my_histogram[] = 10, 0.00012
...
The 0-th entry is an estimate of the number of unique elements it the data set.

For i = 1, 2, ..., the result is the fraction of unique items that appeared exactly i times. That is, 86% data items occur only once, 11% of items occur twice, 1.2% of items occur 3 times etc.

The frequencies should add up to (approx.) 1.0, give or take something for approximation.

WeightedSample

The table weightedsample takes a sample of a given size from the set of all values seen. The sample is biased towards the values with higher weights. For example,
x: table weightedsample(2) of x_i: string weight w_i: float;
emit x <- "a" weight -1.0;  # definitely not chosen
emit x <- "b" weight 0.0;  # definitely not chosen
emit x <- "c" weight 1.0;  # unlikely to be chosen
emit x <- "d" weight 100.0;  # likely to be chosen
emit x <- "e" weight inf;  # definitely chosen, unless more than 2 (table parameter) inputs have weight = inf
picks a sample of 2 strings. The input "d" is more likely to be chosen than "c" because its weight is much higher. The probability to be chosen is generally not proportional to the weight. When all weights are the same, the table is a rigorously uniform random sample, regardless of how the inputs are sharded. The table does not remove duplicated inputs or sum the weights for each distinct input. If you emit "x" twice, with weights 1 and 2, respectively, both of them may get into the table, so it is different from emitting a single "x" with weight 3.

Direct output tables

The output of an emit statement is usually saved by the invoking programfor aggregation and statistics gathering. Sometimes, though, the results are more usefully saved in a file for later analysis by other tools. To this end, Sawzall supports some tables that write results directly to a file in formats easily read by other tools. Integral types are written as 64-bit big-endian ints, floats as little-endian 64-bit IEEE floats, and bytes and strings are unencoded.

Recordio

A recordio is a direct output table that writes its emitted elements directly to a binary file. Each record begins with the length of the data (not counting the length part itself) encoded in varint form. There cannot be an index for a recordio table. The element may be a bool, int, fingerprint, time, bytes, string or float. The parameter gives the compression block size for the file in bytes; zero implies uncompressed output. (Not implemented; all recordio files are uncompressed.) This declaration defines an uncompressed recordio output for fingerprints:
prints: table recordio(0) of fingerprint;

Text

A text is a direct output table that writes its emitted elements directly to a file as text. There cannot be an index for a text table, and the element type must be a string or a bytes. Output is not modified by the table before writing to the file; in particular, newlines are not added. This declaration defines a text output for songs:
songs: table text of string;

Mrcounter

A mrcounter is counter that is only maintained internally instead of being written to a user-specified destination. This table is usually used for monitoring and debugging. It cannot be indexed and accepts only integer elements. For example:
input_count: table mrcounter of int;
emit input_count <- int(def(input));
A C++ program using the szl library to execute a Sawzall program can access the counter value; a typical use would be to update progress information.

Formats

By default, the output generated by an emit statement is transmitted in an internal encoding for use by the underlying data processor. For output to plain files, or to organize data into more tractable form, any output variable (except sum or mrcounter that expect addable elements) may be declared with a format expression. This format is attached at the end of the declaration and can be applied to table elements. Here is a simple example:
allsongs: table collection of s: string format("%s\n", s);
Note how the element is named (s) in order to identify it in the format expression. A more complex example might look like this:

fileoffans: table collection[decade: int] of F: { diva: string, song: string }
  format("%s\t%s\n", F.diva, F.song);

The elements stored in an output variable with a format expression are always strings (generated by the output of the format), not the type in the declaration, while the variables in the format directive are defined by the declaration.

Files

In addition to direct output tables like recordio, the language has a mechanism to decorate the declaration of collection output variables with the name of a file into which results will be written. The syntax is similar to the format syntax, but in this case the relevant input is the indices rather than elements. Note that when using file attributes, table values must be of type bytes or decorated with format. For example, the declaration
songsbydiva: table collection[diva: string] of song: string
  file("/tmp/song.%s", diva) format("%s\n", song);
uses the diva to distinguish file names.

In this example the file name is a constant.

songs: table collection of song: string
  file("/tmp/song") format("%s\n", song);

Procs

A proc is analogous to a file but instead of emitting data to a file, it sends data to a dynamically created external (Unix) process. Note that when using proc attribute, table values must be of type bytes or decorated with format.
echobydiva: table collection[diva: string] of song: string
  proc("echo %s", diva) format("%s\n");
The proc clause specifies the arguments to a new process created for each unique index value.

Patterns

Sawzall defines a special statement, called a when statement, that expresses a pattern to match in the input record and an associated action. Its basic syntax is
when (pattern) action
A when statement is much like an if statement except that its pattern language is more powerful and, potentially, more efficient. This section and the following describe the when pattern language.

In its simplest form, a pattern is a single boolean expression composed from operators such as string and numerical comparison and is usually applied to input variables and constants.

Since patterns are just boolean expressions, they can contain all the available operations for manipulating strings, numerical, values, etc. Also, there are helpful intrinsic functions such as regular expression matching to aid in constructing patterns.

Here is a simple example that uses the match intrinsic function, which returns a boolean recording whether the second argument, a string, matches the regular expression defined by the first argument (the arguments are in the same order as in grep):

match("^Britney", diva) && !match("^Why ", song)
or
sales > 1000000 || match("girl", song)

A note about the logical operators. The various patterns in a Sawzall program may be executed in parallel. Moreover, where possible, the components within each pattern expression may also be evaluated in parallel in some circumstances. However, the C && and || operators define an order of evaluation that precludes such parallel processing. Therefore, Sawzall also includes regular keywords (and, or) for logical operators that do not define evaluation order; judicious use of these operators may result in more efficient execution. Beware, though, that it is not always safe to replace && by and.

Quantifiers

To discover certain patterns in data, it is useful to ask questions of the form: Do all these data items satisfy some condition? Or: Which element of an array satisfies some condition? These questions are analogous to the "for all" and "there exists" quantifiers in logic. For this purpose, the language introduces special pattern variables called quantifiers. Each quantifier must be declared within a pattern and is local to that pattern and its associated action.

Quantifiers come in several forms. The simplest is called a some variable and works as follows. The pattern

x: some int;
lyricword[x] == "foo"
succeeds if, for some value of x, the expression lyricword[x] == "foo" is true. Moreover, it binds the value of x to that index, so it can be used in the action associated with the pattern. Thus this pattern can be read, "for some x, the lyricword array indexed by x is the string foo".

If the pattern can be satisfied by several values of the some variable x, the action will be executed only once, for some arbitrarily chosen value. (Hence the name, some.) It can also be useful to have an action execute once for each possible value; for this purpose, use a different quantifier called an each variable. For example, the pattern

y: each int;
lyricword[y] == "foo"
succeeds and causes execution of its associated action for each value of y that satisfies the condition, that is, for each element of lyricword equal to "foo".

When multiple quantifiers occur in a pattern, the binding of quantifiers to values follows the rule that the variables are bound in the order in which they are declared. For instance, the pattern

x: some int;
y: each int;
lyricword[x] == foo[y]
will succeed if there is some pair of values for x and y that satisfy the condition. If so, x is fixed to some successful value and y is bound, sequentially, to each possible value for which the condition is satisfied for that fixed value of x. If multiple values of x would satisfy the condition, only one is chosen, arbitrarily, subject to the constraint that y also can be bound to at least one value.

To aid in readability, it helps to prefix quantifier variable names with the appropriate term. By this convention, the previous example would be written:

somex: some int;
eachy: each int;
lyricword[somex] == foo[eachy]

The final type of quantifier is an all variable; it requires that the expression succeed for all possible values of the variable. For instance,

allx: all int;
lyricword[allx] == "foo"
succeeds if every element of lyricword has the value "foo". Unlike some and each variables, all variables are not bound to a specific value for the execution of the pattern; their value is implicit in the evaluation of the pattern only.

Quantifiers are defined so that they must be assignable a value, so if in the above examples the arrays had no elements, all the patterns would fail. This condition arises because if the array were empty, we could not bind some and each variables and thus could not execute the action. This constraint does not apply to all variables;

lyricword[allx] != "foo"
will succeed if lyricword is empty. To require a non-empty array, be explicit:
len(lyricword) > 0 and lyricword[allx] != "foo"

For a pattern to succeed, it must be possible to assign a consistent set of values to all the quantifiers that appear within it, subject to usual logical rules. Moreover, the value of a quantifier must be the same throughout the pattern. For instance,

allx: all int;
foo[allx] == bar[allx]
succeeds iff foo and bar have the same (even zero) number of elements and if the corresponding elements have the same value. By contrast,
allx, ally: all int;
foo[allx] == bar[ally]
does not require that the arrays have the same dimension but does require that all the elements of both arrays have the same unique value, since allx and ally will range over their indices independently.

For now, quantifier expressions must contain an indexed array or map to define the range of values for the quantifier variable.

Here are a few illustrative examples. Give the quantifiers

somex, somey: some int;
allz: all int;
the expression
foo[somex] == bar[somex]
finds somex such that the corresponding elements of foo and bar are equal. The expression
foo[somex] == foo[somey] && somex != somey
finds whether there are two different equal elements of foo. Finally, the expression
baz[somex].elem[allz] == "hello"
finds a value of somex such that all the elements of the elem subarray of baz have value "hello".

In general, expressions like the last one could be ambiguous: does it check whether there is a particular value of somex for which all the subelements match, or whether for all subelements there is some, possibly different, value of somex? (In logic, it's a difference of binding order: for all z there exists an x such that ... versus there exists an x such that for all z... .) In this language, however, we require the pattern to bind quantifiers uniquely and in order in a given evaluation of the pattern, so the interpretation is unambiguous.

Why quantifiers?

The existence of quantifiers in the language may seem an unusual decision, but they serve a purpose. Consider the pattern and action
when (
  somex: some int;
  foo[somex] == bar[somex]
) {
  emit ...
}
This sequence is semantically equivalent to something like
{ len = min(foo.len, bar.len);
  for (somex = 0; somex < len && foo[somex] != bar[somex]; somex++) ;
  # somex >= len || foo[somex] == bar[somex]
  if (somex < len) {
    # foo[somex] == bar[somex]
    emit ...
  }
}
Obviously, quantifiers provide some notational parsimony. But the more important difference between these examples is that it is much easier to analyze the pattern in the first example for purposes of optimization and, particularly, parallelization. It would require considerable sophistication for the Sawzall compiler to extract the meaning of the data selection being made in the for loop, and even more sophistication to combine its evaluation with other statements in the program.

In brief, by making the looping over data implicit rather than explicit, it is much easier to optimize the evaluation of the pattern language by automatic compile-time and run-time examination. Also, by restricting the pattern language, we hope to expose more opportunities for optimization.

Actions 2: The action language

The rest of the action language is straightforward and doesn't deserve much elaboration here.

The expression language is much like that of C, with the same operators but no ?: expression (but see the next section). Assignment is a statement, not an expression. Also, increment and decrement operators ++ and -- are postfix only and their use is regarded as an assignment and therefore a statement, not an expression.

Other than the addition of a when statement, which is like an if with quantifiers, and the ability to declare variables anywhere in a block, the control structures are much like C's:

As with any other control structure, when statements can be nested, so one may use one high-level pattern to select lines, then bring other aspects of the language to bear on a subset of the input, perhaps like this:

when (some rare condition) {
  when (some related condition) {
    for (i = 0; i < N; i++)
      emit output <- f(i);
  }
}

Statement expressions

Sawzall does not have C's ?: expression but it does have a more general construct called a statement expression.

A statement expression looks like a regular block of statements with a question mark ? preceding it. The statements in the block are executed, and the value of the statement expression is the value of the first result statement executed within. This example shows how to write the equivalent of a ?: expression:

a := ?{ if (x > y) result x; else result y; };
The values of all the result statements in the block must have the same type.

Any statements can appear inside the block. return statements cannot appear in the block without an enclosing function.

# good
sign := ?{ switch (true) { case x < 0: result -1; case x > 0: result 1; default: result 0; } };

# bad
x := function(): int {
  n := ?{
    return 2;  # compile-time error
  };
};
Here is a more complex example that puts a statement expression in the condition of a when statement, including a declaration and debugging print:
when(x: each int; ?{
		static kWeekend: array of string = { "Saturday", "Sunday" };
		emit stdout <- format("try %s", string(t));
		result t.day[x] == kWeekend[x];
	}
) {
	WeekendWarrior(t);
}
When the result is encountered, execution of the rest of the block is skipped, analogous to returning from a function. Similarly, it is an error if execution reaches the end of the block without executing a result statement.

Memory

The memory model is generally straightforward but there are a few differences when compared to that of most languages. They derive from the unusual execution model, which is designed for efficient parallel processing of huge amounts of record-oriented data.

Array allocation

Arrays in Sawzall are often created by an initializer or expression that defines their size and contents, but sometimes one wants to create an array of a given size for later use. In C, one would use malloc for this; in Sawzall, there is an array creation operator new. It takes three arguments: the type of the array, the number of elements, and an initializer expression that is used to set the value of all elements of the array upon creation. The syntax is straightforward. Here's an example that creates an array of ten integers, each set to zero:
a: array of int = new(array of int, 10, 0);
To create multidimensional arrays, cascade the new operators:
aa: array of array of string =
  new(array of array of string, 3, new(array of string, 3, ""));

One must always provide an initializer expression to guarantee all array elements are defined and thereby protect the execution of the program from undefined values.

On the other hand, if one wants an empty array, one can write the literal expression {} rather than new(array of int, 0, 0).

Map allocation

Unlike arrays, maps are of dynamic size; they grow as elements are added. (At present, there is no way to delete a key from a map, although one can change the value associated with a key.) The new operator can also be used to create a map, except that the resulting map is always empty and therefore one need not specify the size or initializer:
m: map[int] of int = new(map[int] of int);
If one knows the final size of the map, though, it may be more efficient to provide an explicit size; this trick avoids reallocation as the map grows:
squares := new(map[int] of int, 1000);
for (i := 0; i < len(squares); i++)
  squares[i] = i * i;
Usually, it is sufficient and easier just to use a map literal, as in:
m: map[int] of int = {:};
Note that, as with arrays, one cannot store entries in a map until it has been created; an empty map is not the same as an uninitialized one.

Functions

Sawzall permits the programmer to define functions. This example, a recursive Fibonacci function, demonstrates the syntax:
Fibo := function(n: int): int {
  if (n > 2)
    return Fibo(n-1) + Fibo(n-2);
  else
    return n;
};
Note that the syntax is analogous to that of initializing a variable, with the name, the type (usually derived from the initializer), and the initializer. (For compatibility with old code, the equals sign can be omitted but that style is deprecated.) The function type itself contains the keyword function, a parenthesized parameter list, another colon, the return type, while the initializer is the function body in braces. Note the semicolon after the final brace.

The return statement returns the function value. If the function has no return value (void in C terminology), just leave out the return type in the declaration and the value in the return statement as illustrated in the following example:

Prime := function(x: int): bool {
  # returns true if x is a prime number, returns false otherwise
};

PrintPrime := function(beg: int, end: int) {
  for (x: int = beg; x < end; x++)
    if (Prime(x)) {
      emit stdout <- format("%d is prime", x);
      return;
    }
};
The function printprime prints a prime number in the interval [beg, end), if there is any, and does nothing otherwise. The final return statement is omitted altogether since the function will return automatically.

Functions may be declared anywhere, not just at the top level of the program. Nested functions may access local variables in the surrounding scopes as well. For instance, the previous example could have been written with nesting:

PrintPrime := function(beg: int, end: int) {
  x: int = beg;

  Prime := function(): bool {
    # returns true if x is a prime number, returns false otherwise
  };

  while (x < end) {
    if (Prime()) {
      emit stdout <- format("%d is prime", x);
      return;
    }
    x++;
  }
};
In this case, the nested function prime accesses the variable x defined in the surrounding scope. Note that functions are stored in variables like any other value: The prime "function" is just a variable named prime with the body of a function as its value. Thus, functions (or more precisely: function values) may be assigned to variables like any other value. However, in contrast to ordinary variable values, functions must not escape the surrounding scope: A function itself may be passed as parameter to other functions, but it cannot be assigned to variables that survive the lifetime of the surrounding scope (because the function might access variables defined in the surrounding scope when the corresponding surrounding function invocation doesn't exist anymore).

Functions called in initialization expressions for static variables must be static functions, that is, they must be declared to be static, as in this initialization code:

static MakeMap := function(n: int): map[key: int] of value: string {
  m: map[int] of string = {:};
  for (i := 0; i < n; i++)
    m[i] = string(i);
  return m;
};

static num_map := MakeMap(100);
This facility allows one to execute complex calculations statically, that is, once before all input processing begins. However, a function can be declared static only if it depends only on static data such as constants and other static variables. In other words, a static function must be computable at static execution time. (It is fine to invoke a static function with non-static data as arguments, but not in order to initialize a static variable.)

Most builtin functions are static.

Programs

A program consists of its input definition, its output definition, and then one or more pattern/action sets, in the following syntax:
input declarations
output declarations 

when (pattern) {
  action statements
}

when ...

This at least is the conceptual model for programming in the Sawzall language. In general, the format is more flexible than this, since a program is really just a sequence of statements, when clauses can be nested, and so on. But for the purpose of organization, it is good style to express as much of the pattern matching/filtering as possible in the top level, as this schematic program illustrates.

Each input line is parsed according to the input rules and then passed to the program, which executes from top to bottom. The implementation may choose to execute when statements in parallel.

Some of the language's unusual features are there to make this parallelism feasible. Each pattern/action set may run independently; there is no state (variables, etc.) shared between them other than the input variables, which are read-only, and the output variables, which are write-only. The order of execution of emit statements in different action blocks is undefined; the program behaves as if all the pattern/action blocks execute in parallel and finish execution in arbitrary order. This is exactly what will happen in the off-line case when the execution is distributed over many machines. For similar reasons, the language does not guarantee that the same emit statement executing on different records will generate results in input order.

Examples

This first example is a filter that just selects output lines and emits them in their original (protocol buffer) form if they originate on a particular machine:
# Input is custom logs in protocol buffer format
log_record: proto LogRecordProto = input;

# singleton output stream
output: table collection of bytes;

# Select the relevant records
when (log_record.diva == "Britney")
  emit output <- input;
Here is the Sawzall equivalent of the Unix wc.
nlines: table sum of int;  # Lines
nwords: table sum of int;  # White-space separated tokens
nchars: table sum of int;  # Unicode characters
nbytes: table sum of int;  # Raw bytes

emit nlines <- 1;
emit nwords <- len(sawzall(string(input), "[^ \t]+"));
emit nchars <- len(string(input));
emit nbytes <- len(input);
The next example is part of a regression test for the regex operator. It demonstrates the use of an initialized array.
ftab: array of { s: string, f: float, ok: bool } = {
  { "1", 1.0, true },
  { "-1", -1.0, true },
  { "+1", 1.0, true },
  { "1.0", 1.0, true },
  { ".02E+3", .02E+3, true },  # and so on

  { "-", 1.0, false },
  { "+", 1.0, false },
  { " abc ", 1.0, false }  # and so on
};

for (i := 0; i < len(ftab); i++){
  # anchor the regex result to force match of whole string
  v: array of float = saw(ftab[i].s, "^"+regex(float)+"$");
  if ((len(v) == 1) != ftab[i].ok)
    emit stdout <-
      format("float[%d] FAIL (validity) for %s\n", i, ftab[i].s);
  else if (ftab[i].ok && v[0] != ftab[i].f)
    emit stdout <-
      format("float[%d] FAIL (value) for %s: %g\n", i, ftab[i].s, ftab[i].f);
}
Here is a more typical example that counts references about pop stars.
proto "LogRecordProto.proto"

# Words of the news article are separated by plus signs
type WordType = array of string;

# Interpret input using our record type
log_record: LogRecordProto = input;

# Split article into words
articlewords: WordType = sawzall(string(log_record.article), "[^+]+");

# Read table of divas from a file
static divas: array of string = sawzall(load("/home/szluser/divas"), ".+");

# count of all articles that mention any diva
articlecount: table sum[article: string] of int;

# count of how many times each diva appears
divacount: table sum[diva: string] of int;

# Count the relevant articles
when (
  somei: some int;
  somej: some int;
  lowercase(articlewords[somei]) == divas[somej]
)
  emit articlecount[log_record.article] <- 1;

# Count the divas individually
when (
  eachi: each int;
  eachj: each int;
  lowercase(articlewords[eachi]) == divas[eachj]
)
  emit divacount[divas[eachj]] <- 1;
A full scan of the list of divas for each input line would be expensive; the implementation could recognize that divas is a static table and construct an efficient lookup structure for use in this example.

Of course, not all pop stars have single-word names. We can rewrite the first pattern to handle two-word names such as christina aguilera:

# Count the relevant articles
when (
  somei: some int;
  somej: some int;
  lowercase(articlewords[somei] + " " + articlewords[somei+1]) == divas[somej]
)
  emit articlecount[divas[somej]] <- 1;
Because a pattern that indexes outside an array implicitly but silently fails, there is no need to prefix the pattern with a predicate to make sure the array has those elements. If it does not, the pattern will fail silently, as it should.

Some data files store their components as comma separated values with quoting rules, as in this example:

1966,4, "Mamas and the Papas", "Monday, Monday"
1968,9, """Mama Cass"" Elliot", "Dream a little dream"
In such cases one might write something like:
static songs: map[title: string] of artist: string =
  splitcsv(load("/home/google/top40.csv"), { 4, 3 });
If the CSV is being read on input, then one can split it apart like this:
fields: array of string = splitcsvline(input);