blob: 05b1c520379ffbe29cbc661603a531f403b9db39 [file] [log] [blame] [view] [edit]
# SystemVerilog Lexer and Parser
<!--*
freshness: { owner: 'hzeller' reviewed: '2020-10-07' }
*-->
This directory contains the SystemVerilog lexer and parser implementations. The
goal for the parser is to be able to accept all valid SystemVerilog (IEEE
1800-2017), as defined in the [SV-LRM]. As of 2019, it accepts the vast majority
of SystemVerilog syntax, but there is work ahead to reach 100%. Progress towards
this goal is measured against open-source language-compliance tests at
https://symbiflow.github.io/sv-tests/.
Unlike conventional toolchains' parsers that expect preprocessed forms as input,
this parser accepts **unpreprocessed** code with some limitations. Thus,
preprocessing directives are accommodated in the
[implemented grammar](verilog.y).
## Decoupled Design
The lexer and parser are *decoupled*, which means that the lexer can be used
standalone to tokenize text, and the parser is adapted to accept tokens from
sources other than the direct use of the lexer. This separation enables the
insertion of different passes between the lexer and parser, such as integrated
preprocessing, and context-based lexical disambiguation (with arbitrary
lookahead) where required by the language.
## Lexer
The lexer is generated by [Flex]. Token enumerations come from the
[parser](verilog.y). The generated lexer implementation is wrapped in an
[adapter](verilog_lexer.h) that makes it return tokens (instead of just an
`int`). The stream of tokens returned by the [lexer](verilog_lexer.h) have the
following properties:
* **Continuity**: The text range end of one token is equal to the text range
start of the next token.
* **Completeness**: The text range spanned by the first and last tokens is
equal to that of the original text that was scanned.
This also means insignficant whitespace text (spaces, newlines) are represented
as tokens, and non-syntax tokens such as comments are included. Such tokens are
easy to filter out before passing them onto the parsing phase.
This lexer follows SystemVerilog lexical definitions, including that of the
preprocessing sub-language, because it is targeted at _unpreprocessed_ code.
We provide a
[standalone tool for examining tokens for any valid SV source file](../tools/syntax).
### Token Classifications
[Token classification functions](verilog_token_classifications.h) provides
functions that logically group together sets of token enumerations, so that
client code does not have to repeat the same logic in different places.
## Parser
The parser is generated by [Bison], an LALR(1) parser generator. The generated
parser implementation is wrapped in an [adapter](verilog_parser.h) that allows
it to work on tokens from any source, not just the lexer. This gives developers
the opportunity to inserter filtering or transformation passes between the lexer
and parser.
The parser outputs a [concrete syntax tree (CST)](../CST) whose generic nodes
are "typed" using [enumerations](../CST/verilog_nonterminals.h).
We provide a
[standalone tool for examining the CST for any valid SV source file](../tools/syntax).
## Lexical Context
Parsing SystemVerilog is wrought with challenges that defy conventional LR-
grammars. [LexicalContext](verilog_lexical_context.h) is a token transformation
pass that aims to help the Bison-generated parser implementation by:
1. Disambiguating tokens that are used in multiple syntactic contexts.
1. Performing distance lookahead that would otherwise not be expressible in an
LR grammar.
It operates like a composition of state-machines that scan and mutate tokens.
<!-- reference links -->
[SV-LRM]: https://ieeexplore.ieee.org/document/8299595
[Bison]: https://www.gnu.org/software/bison/
[Flex]: https://www.gnu.org/software/flex/