Lexical Scanning in Go
GTUG Sydney
30 August 2011
Rob Pike
Rob Pike
A video of this talk was recorded at the Go Sydney Meetup.
2Many programming problems realign one data structure to fit another structure.
The pieces on either side have independent state, lookahead, buffers, ...
Can be messy to do well.
Coroutines were invented to solve this problem!
They enable us to write the two pieces independently.
Let's look at this topic in the context of a lexer.
4
Wanted to replace the old Go template package.
It had many problems:
Key change was re-engineering with a true lexer, parser, and evaluator.
Has arbitrary text plus actions in {{
}}
.
Evaluation: {{.Title}} Constants and functions: {{printf "%g: %#3X" 1.2+2i 123}} Control structures {{range $s.Text}} {{.}} {{end}}
Must tokenize:
Two things identify each lexed item:
// item represents a token returned from the scanner. type item struct { typ itemType // Type, such as itemNumber. val string // Value, such as "23.2". }
The type is just an integer constant.
We use iota
to define the values.
// itemType identifies the type of lex items.
const ( itemError itemType = iota // error occurred; // value is text of error itemDot // the cursor, spelled '.' itemEOF
itemElse // else keyword itemEnd // end keyword itemField // identifier, starting with '.' itemIdentifier // identifier itemIf // if keyword itemLeftMeta // left meta-string itemNumber // number itemPipe // pipe symbol itemRange // range keyword itemRawString // raw quoted string (includes quotes) itemRightMeta // right meta-string itemString // quoted string (includes quotes) itemText // plain text )
Printf
has a convention making it easy to print any type: just define a String()
method:
func (i item) String() string { switch i.typ { case itemEOF: return "EOF" case itemError: return i.val } if len(i.val) > 10 { return fmt.Sprintf("%.10q...", i.val) } return fmt.Sprintf("%q", i.val) }
Many approaches available:
Nothing wrong with using a tool but:
Blogged about this last week.
It's easy!
Plus, most programming languages lex pretty much the same tokens, so once we learn how it's trivial to adapt the lexer for the next purpose.
Many people will tell you to write a switch statement,
something like this:
// One iteration: switch state { case state1: state = action1() case state2: state = action2() case state3: state = action3() }
Boring and repetitive and error-prone, but anyway:
Why switch?
After each action, you know where you want to be;
the new state is the result of the action.
But we throw the info away and recompute it from the state.
(A consequence of returning to the caller.)
A tool can compile that out, but so can we.
17State represents where we are and what we expect.
Action represents what we do.
Actions result in a new state.
18Let's put them together: a state function.
Executes an action, returns the next state—as a state function.
Recursive definition but simple and clear.
// stateFn represents the state of the scanner // as a function that returns the next state. type stateFn func(*lexer) stateFn
Our state machine is trivial:
just run until the state goes to nil
, representing "done".
// run lexes the input by executing state functions // until the state is nil. func run() { for state := startState; state != nil; { state = state(lexer) } }
How do we make tokens available to the client?
Tokens can emerge at times that are inconvenient to stop to return to the caller.
Use concurrency:
Run the state machine as a goroutine,
emit values on a channel.
Here is the lexer
type. Notice the channel of items; ignore the rest for now.
// lexer holds the state of the scanner. type lexer struct { name string // used only for error reports. input string // the string being scanned. start int // start position of this item. pos int // current position in the input. width int // width of last rune read from input. items chan item // channel of scanned items. }
A lexer
initializes itself to lex a string and launches the state machine as a goroutine, returning the lexer itself and a channel of items.
The API will change, don't worry about it now.
func lex(name, input string) (*lexer, chan item) { l := &lexer{ name: name, input: input, items: make(chan item), } go l.run() // Concurrently run state machine. return l, l.items }
Here's the real state machine run function, which runs as a goroutine.
// run lexes the input by executing state functions until // the state is nil. func (l *lexer) run() { for state := lexText; state != nil; { state = state(l) } close(l.items) // No more tokens will be delivered. }
A token is a type and a value, but (yay Go) the value can just be sliced from the input string.
The lexer
remembers where it is in the input and the emit routine just lobs that substring to the caller as the token's value.
input string // the string being scanned. start int // start position of this item. pos int // current position in the input.
// emit passes an item back to the client. func (l *lexer) emit(t itemType) { l.items <- item{t, l.input[l.start:l.pos]} l.start = l.pos }
As the lexer
begins it's looking for plain text, so the initial state is the function lexText
.
It absorbs plain text until a "left meta" is encountered.
// run lexes the input by executing state functions until // the state is nil. func (l *lexer) run() { for state := lexText; state != nil; { state = state(l) } close(l.items) // No more tokens will be delivered. }
const leftMeta = "{{"
func lexText(l *lexer) stateFn { for { if strings.HasPrefix(l.input[l.pos:], leftMeta) { if l.pos > l.start { l.emit(itemText) } return lexLeftMeta // Next state. } if l.next() == eof { break } } // Correctly reached EOF. if l.pos > l.start { l.emit(itemText) } l.emit(itemEOF) // Useful to make EOF a token. return nil // Stop the run loop. }
A trivial state function.
When we get here, we know there's a leftMeta
in the input.
func lexLeftMeta(l *lexer) stateFn { l.pos += len(leftMeta) l.emit(itemLeftMeta) return lexInsideAction // Now inside {{ }}. }
func lexInsideAction(l *lexer) stateFn { // Either number, quoted string, or identifier. // Spaces separate and are ignored. // Pipe symbols separate and are emitted. for { if strings.HasPrefix(l.input[l.pos:], rightMeta) { return lexRightMeta } switch r := l.next(); { case r == eof || r == '\n': return l.errorf("unclosed action") case isSpace(r): l.ignore() case r == '|': l.emit(itemPipe)
This will give you the flavor.
case r == '"': return lexQuote case r == '`': return lexRawQuote
case r == '+' || r == '-' || '0' <= r && r <= '9': l.backup() return lexNumber case isAlphaNumeric(r): l.backup() return lexIdentifier
// next returns the next rune in the input. func (l *lexer) next() (rune int) { if l.pos >= len(l.input) { l.width = 0 return eof } rune, l.width = utf8.DecodeRuneInString(l.input[l.pos:]) l.pos += l.width return rune }
// ignore skips over the pending input before this point. func (l *lexer) ignore() { l.start = l.pos }
// backup steps back one rune. // Can be called only once per call of next. func (l *lexer) backup() { l.pos -= l.width }
// peek returns but does not consume // the next rune in the input. func (l *lexer) peek() int { rune := l.next() l.backup() return rune }
// accept consumes the next rune // if it's from the valid set. func (l *lexer) accept(valid string) bool { if strings.IndexRune(valid, l.next()) >= 0 { return true } l.backup() return false }
// acceptRun consumes a run of runes from the valid set. func (l *lexer) acceptRun(valid string) { for strings.IndexRune(valid, l.next()) >= 0 { } l.backup() }
func lexNumber(l *lexer) stateFn { // Optional leading sign. l.accept("+-") // Is it hex? digits := "0123456789" if l.accept("0") && l.accept("xX") { digits = "0123456789abcdefABCDEF" } l.acceptRun(digits) if l.accept(".") { l.acceptRun(digits) } if l.accept("eE") { l.accept("+-") l.acceptRun("0123456789") } // Is it imaginary?
This is more accepting than it should be, but not by much. Caller must call Atof
to validate.
// Is it imaginary? l.accept("i") // Next thing mustn't be alphanumeric. if isAlphaNumeric(l.peek()) { l.next() return l.errorf("bad number syntax: %q", l.input[l.start:l.pos]) } l.emit(itemNumber) return lexInsideAction }
Easy to handle: emit the bad token and shut down the machine.
// error returns an error token and terminates the scan // by passing back a nil pointer that will be the next // state, terminating l.run. func (l *lexer) errorf(format string, args ...interface{}) stateFn { l.items <- item{ itemError, fmt.Sprintf(format, args...), } return nil }
Concurrency makes the lexer easy to design.
Goroutines allow lexer and caller (parser) each to run at its own rate, as clean sequential code.
Channels give us a clean way to emit tokens.
38
Can't run a goroutine to completion during initialization.
Forbidden by the language specification.
(Raises awful issues about order of init, best avoided.)
That means we can't lex & parse a template during init.
The goroutine is a problem....
(Note: This restriction was lifted in Go version 1 but the discussion is still interesting.)
39…but it's not necessary anyway.
The work is done by the design; now we just adjust the API.
We can change the API to hide the channel, provide a function to get the next token, and rewrite the run function.
It's easy.
40Hide the channel and buffer it slightly, turning it into a ring buffer.
// lex creates a new scanner for the input string. func lex(name, input string) *lexer { l := &lexer{ name: name, input: input, state: lexText, items: make(chan item, 2), // Two items sufficient. } return l }
Traditional lexer API: return next item.
Includes the modified state machine runner.
// nextItem returns the next item from the input. func (l *lexer) nextItem() item { for { select { case item := <-l.items: return item default: l.state = l.state(l) } } panic("not reached") }
We now have a traditional API for a lexer with a simple, concurrent implementation under the covers.
Even though the implementation is no longer truly concurrent, it still has all the advantages of concurrent design.
We wouldn't have such a clean, efficient design if we hadn't thought about the problem in a concurrent way, without worrying about "restart".
Model completely removes concerns about "structural mismatch".
43Concurrency is not about parallelism.
(Although it can enable parallelism).
Concurrency is a way to design a program by decomposing it into independently executing pieces.
The result can be clean, efficient, and very adaptable.
44Lexers are fun.
Concurrency is fun.
Go is fun.
45Go: golang.org
New templates: http://golang.org/pkg/exp/template/
(Next release will move them out of experimental.)
46Rob Pike