Stacks of Tokens
A study in interfaces
Sydney Go Meetup
15 September 2016
Rob Pike
Sydney Go Meetup
15 September 2016
Rob Pike
Spoke at Gophercon this year: go.dev/talks/2016/asm.slide
That talk was about system design and portability.
Today's talk is about its lexer.
Spoke about lexing before: go.dev/talks/2011/lex.slide
That talk showed a way to use concurrency to build a lexer.
Today's talk is about interfaces.
2A lexer, also called a scanner, breaks the input stream into distinct "tokens":
Each token has a type and a value.
3MOVQ R0, 4(R1)
The lexer turns this into the stream
MOVQ
R0
,
4
(
R1
)
Spaces are ignored.
The parser then reads these tokens to parse the input into a parse tree.
4There is a nice, efficient lexer package in the Go standard library:
It can do this job just fine.
But.... that is not enough for the assembler because of
5The new Go assembler had to be totally compatible with the ones it replaces, which used YACC and were written in C. (See https://go.dev/talks/2015/gogo.slide.)
Each assembler (one per architecture) contained these lines at the end of lex.c
:
#include "../cc/lexbody" #include "../cc/macbody"
This gave the assemblers the same lexer as the C compiler.
The differences between C tokens and Go tokens are minor and can be handled, but....
The C lexer brings in something problematic.
6
The old assemblers had a C preprocessor built in!
An old-fashioned one, without #if
and token pasting, but still:
#include "file" #define MAXBUF 512 #define MULDIV(a, b, c) ((a)*(b)/(c)) #ifdef MAXBUF #endif
The text/scanner
package can't handle this.
But we need to do it to be compatible.
This talk is about how to use Go's interfaces to do it.
7What is standard input? An input source.
What is an included file? An input source.
What is a macro invocation? An input source.
Sounds a lot like io.Reader
.
We don't want bytes, we want tokens. (Why?)
Instead of
type Reader interface { Read(p []byte) (n int, err error) }
we want something like
type TokenReader interface { ReadToken() (Token, error) }
In practice the parser needs something different from Read
, but the basic idea works.
We build a lexer around an interface that reads tokens.
9What is standard input? An input source.
What is an included file? An input source.
What is a macro invocation? An input source.
Each of these implements the TokenReader
interface.
type TokenReader interface { // Next returns the next token. Next() ScanToken // The following methods all refer to the most recent token returned by Next. // Text returns the original string representation of the token. Text() string // File reports the source file name of the token. File() string // Line reports the source line number of the token. Line() int // Col reports the source column number of the token. Col() int }
Parser calls Next
, then can ask about the token: what is, where it is.
ScanToken
is just text/scanner.Token
with tweaks.
Note: No Peek
. This has no lookahead.
Tokenizer
, the foundational TokenReader
, turns a text/scanner.Scanner
into a TokenReader
.
// A Tokenizer is a simple wrapping of text/scanner.Scanner, configured // for our purposes and made a TokenReader. It forms the lowest level, // turning text from readers into tokens. type Tokenizer struct { tok ScanToken // Most recent token. s *scanner.Scanner line int fileName string } func NewTokenizer(name string, r io.Reader, file *os.File) *Tokenizer
Either the reader or the file may be nil.
Tokenizer
implements TokenReader
To give the flavor:
func (t *Tokenizer) Next() ScanToken { s := t.s for { t.tok = ScanToken(s.Scan()) if t.tok != scanner.Comment { break } length := strings.Count(s.TokenText(), "\n") t.line += length histLine += length // For now, just discard all comments. } // Special processing for '\n' etc. elided. return t.tok }
func (t *Tokenizer) Text() string { switch t.tok { case LSH: // Special handling of odd tokens used by ARM. return "<<" case RSH: return ">>" case ARR: return "->" case ROT: return "@>" } return t.s.TokenText() }
LSH
etc. are the reason for ScanToken
: the set of tokens is a superset of the underlying type text/scanner.Token
.
It's easy to see how files work: NewTokenizer
can do that.
What about a macro definition?
#define A(arg) 27+(arg)
Becomes the tokens
27 + ( arg )
When we encounter A(x)
, we substitute the argument and get
27 + ( x )
Use a slice of tokens and store them in a Macro
struct.
type Macro struct { name string // The #define name. args []string // Formal arguments. tokens []Token // Body of macro. }
// A Slice reads from a slice of Tokens. type Slice struct { tokens []Token fileName string line int pos int }
Implements TokenReader
.
func (s *Slice) Next() ScanToken { s.pos++ if s.pos >= len(s.tokens) { return scanner.EOF } return s.tokens[s.pos].ScanToken }
To invoke a macro, substitute the actuals for the formals and make a Slice
.
A command-line flag -D
can define a macro before execution:
go tool asm -D 'macro=value' file.s
That's easy!
var DFlag MultiFlag flag.Var(&DFlag, "D", "predefined symbol D=identifier...") type MultiFlag []string // Implements flag.Value, allows multiple settings. predefine(DFlag)
// predefine installs the macros set by the -D flag on the command line. func predefine(defines MultiFlag) map[string]*Macro { macros := make(map[string]*Macro) for _, name := range defines { value := "1" i := strings.IndexRune(name, '=') if i > 0 { name, value = name[:i], name[i+1:] } // Various error checks elided. macros[name] = &Macro{ name: name, args: nil, // No arguments allowed. tokens: Tokenize(value), // Turn the value into tokens. } } return macros }
The return value is the initial symbol table of macros.
18We know how to:
io.Reader
But how does it fit together?
Which implementation TokenReader
does the parser see?
Consider
#include
names a file to process nextIt's a stack! Push new input, pop at EOF.
19// A Stack is a stack of TokenReaders. As the top TokenReader hits EOF, // it resumes reading the next one down. type Stack struct { tr []TokenReader } // Push adds tr to the top (end) of the input stack. (Popping happens automatically.) func (s *Stack) Push(tr TokenReader) { s.tr = append(s.tr, tr) } func (s *Stack) Next() ScanToken { tos := s.tr[len(s.tr)-1] tok := tos.Next() for tok == scanner.EOF && len(s.tr) > 1 { // Pop the topmost item from the stack and resume with the next one down. s.tr = s.tr[:len(s.tr)-1] tok = s.Next() } return tok }
// Input is the main input: a stack of readers and some macro definitions. // It also handles #include processing (by pushing onto the input stack) // and parses and instantiates macro definitions. type Input struct { Stack includes []string // Directories in which to look for #includes macros map[string]*Macro text string // Text of last token returned by Next. ... }
Note the embedding of Stack
: Input
is a wrapping of the Stack
implementation of TokenReader
.
The parser uses a single instance of Input
as its TokenReader
.
Some error handling elided for brevity.
func (in *Input) include() { // Find and parse file name, which is next token, a string. tok := in.Stack.Next() name, _ := strconv.Unquote(in.Stack.Text()) in.expectNewline("#include") // Checks that a newline comes now. // Push tokenizer for file onto stack. fd, err := os.Open(name) if err != nil { for _, dir := range in.includes { fd, err = os.Open(filepath.Join(dir, name)) if err == nil { break } } } in.Push(NewTokenizer(name, fd, fd)) }
Macro definition is similar but more complex because of the variety of forms.
Must deal with constants, empty values, macros with arguments, etc.
The end result is to build a Macro
value and install it in Input.macros
.
Here is the implementation of a CPP input stream using these types.
(Error handling mostly elided for brevity.)
func (in *Input) Next() ScanToken { // If we cannot generate a token after 100 macro invocations, we're in trouble. for nesting := 0; nesting < 100; { tok := in.Stack.Next() switch tok { case '#': in.hash() case scanner.Ident: // Is it a macro name? name := in.Stack.Text() macro := in.macros[name] if macro != nil { nesting++ in.invokeMacro(macro) continue } fallthrough default: // Continued on next slide.
func (in *Input) Next() ScanToken { // Continued from previous slide. default: if tok == scanner.EOF && len(in.ifdefStack) > 0 { // We're skipping text but have run out of input with no #endif. in.Error("unclosed #ifdef or #ifndef") } if in.enabled() { in.text = in.Stack.Text() return tok } } } in.Error("recursive macro invocation") return 0 }
// NewInput returns an Input from the given path. func NewInput(name string) *Input { return &Input{ // include directories: look in source dir, then -I directories. includes: append([]string{filepath.Dir(name)}, IFlag...), macros: predefine(DFlag), } }
To run, call in.Push
to put the input file (or os.Stdin
) on the stack.
Then the lexer runs until the Stack
is empty.
Interfaces give programs structure.
Interfaces encourage design by composition.
Interfaces enable and enforce clean divisions between components.
TokenReader
let us implement #include
files, #define
macros, command-line flags, #ifdef
and more with one simple interface.
And a final tip of the hat to text/scanner
under it all.
Sydney Go Meetup
15 September 2016