| # SystemVerilog Lexer and Parser |
| |
| <!--* |
| freshness: { owner: 'hzeller' reviewed: '2020-10-07' } |
| *--> |
| |
| This directory contains the SystemVerilog lexer and parser implementations. The |
| goal for the parser is to be able to accept all valid SystemVerilog (IEEE |
| 1800-2017), as defined in the [SV-LRM]. As of 2019, it accepts the vast majority |
| of SystemVerilog syntax, but there is work ahead to reach 100%. Progress towards |
| this goal is measured against open-source language-compliance tests at |
| https://symbiflow.github.io/sv-tests/. |
| |
| Unlike conventional toolchains' parsers that expect preprocessed forms as input, |
| this parser accepts **unpreprocessed** code with some limitations. Thus, |
| preprocessing directives are accommodated in the |
| [implemented grammar](verilog.y). |
| |
| ## Decoupled Design |
| |
| The lexer and parser are *decoupled*, which means that the lexer can be used |
| standalone to tokenize text, and the parser is adapted to accept tokens from |
| sources other than the direct use of the lexer. This separation enables the |
| insertion of different passes between the lexer and parser, such as integrated |
| preprocessing, and context-based lexical disambiguation (with arbitrary |
| lookahead) where required by the language. |
| |
| ## Lexer |
| |
| The lexer is generated by [Flex]. Token enumerations come from the |
| [parser](verilog.y). The generated lexer implementation is wrapped in an |
| [adapter](verilog_lexer.h) that makes it return tokens (instead of just an |
| `int`). The stream of tokens returned by the [lexer](verilog_lexer.h) have the |
| following properties: |
| |
| * **Continuity**: The text range end of one token is equal to the text range |
| start of the next token. |
| * **Completeness**: The text range spanned by the first and last tokens is |
| equal to that of the original text that was scanned. |
| |
| This also means insignficant whitespace text (spaces, newlines) are represented |
| as tokens, and non-syntax tokens such as comments are included. Such tokens are |
| easy to filter out before passing them onto the parsing phase. |
| |
| This lexer follows SystemVerilog lexical definitions, including that of the |
| preprocessing sub-language, because it is targeted at _unpreprocessed_ code. |
| |
| We provide a |
| [standalone tool for examining tokens for any valid SV source file](../tools/syntax). |
| |
| ### Token Classifications |
| |
| [Token classification functions](verilog_token_classifications.h) provides |
| functions that logically group together sets of token enumerations, so that |
| client code does not have to repeat the same logic in different places. |
| |
| ## Parser |
| |
| The parser is generated by [Bison], an LALR(1) parser generator. The generated |
| parser implementation is wrapped in an [adapter](verilog_parser.h) that allows |
| it to work on tokens from any source, not just the lexer. This gives developers |
| the opportunity to inserter filtering or transformation passes between the lexer |
| and parser. |
| |
| The parser outputs a [concrete syntax tree (CST)](../CST) whose generic nodes |
| are "typed" using [enumerations](../CST/verilog_nonterminals.h). |
| |
| We provide a |
| [standalone tool for examining the CST for any valid SV source file](../tools/syntax). |
| |
| ## Lexical Context |
| |
| Parsing SystemVerilog is wrought with challenges that defy conventional LR- |
| grammars. [LexicalContext](verilog_lexical_context.h) is a token transformation |
| pass that aims to help the Bison-generated parser implementation by: |
| |
| 1. Disambiguating tokens that are used in multiple syntactic contexts. |
| 1. Performing distance lookahead that would otherwise not be expressible in an |
| LR grammar. |
| |
| It operates like a composition of state-machines that scan and mutate tokens. |
| |
| <!-- reference links --> |
| |
| [SV-LRM]: https://ieeexplore.ieee.org/document/8299595 |
| [Bison]: https://www.gnu.org/software/bison/ |
| [Flex]: https://www.gnu.org/software/flex/ |