verilog/parser/README.md - third_party/verible - Git at Google

 # SystemVerilog Lexer and Parser

 <!--*
 freshness: { owner: 'hzeller' reviewed: '2020-10-07' }
 *-->

 This directory contains the SystemVerilog lexer and parser implementations. The
 goal for the parser is to be able to accept all valid SystemVerilog (IEEE
 1800-2017), as defined in the [SV-LRM]. As of 2019, it accepts the vast majority
 of SystemVerilog syntax, but there is work ahead to reach 100%. Progress towards
 this goal is measured against open-source language-compliance tests at
 https://symbiflow.github.io/sv-tests/.

 Unlike conventional toolchains' parsers that expect preprocessed forms as input,
 this parser accepts **unpreprocessed** code with some limitations. Thus,
 preprocessing directives are accommodated in the
 [implemented grammar](verilog.y).

 ## Decoupled Design

 The lexer and parser are *decoupled*, which means that the lexer can be used
 standalone to tokenize text, and the parser is adapted to accept tokens from
 sources other than the direct use of the lexer. This separation enables the
 insertion of different passes between the lexer and parser, such as integrated
 preprocessing, and context-based lexical disambiguation (with arbitrary
 lookahead) where required by the language.

 ## Lexer

 The lexer is generated by [Flex]. Token enumerations come from the
 [parser](verilog.y). The generated lexer implementation is wrapped in an
 [adapter](verilog_lexer.h) that makes it return tokens (instead of just an
 `int`). The stream of tokens returned by the [lexer](verilog_lexer.h) have the
 following properties:

 *   **Continuity**: The text range end of one token is equal to the text range
     start of the next token.
 *   **Completeness**: The text range spanned by the first and last tokens is
     equal to that of the original text that was scanned.

 This also means insignficant whitespace text (spaces, newlines) are represented
 as tokens, and non-syntax tokens such as comments are included. Such tokens are
 easy to filter out before passing them onto the parsing phase.

 This lexer follows SystemVerilog lexical definitions, including that of the
 preprocessing sub-language, because it is targeted at _unpreprocessed_ code.

 We provide a
 [standalone tool for examining tokens for any valid SV source file](../tools/syntax).

 ### Token Classifications

 [Token classification functions](verilog_token_classifications.h) provides
 functions that logically group together sets of token enumerations, so that
 client code does not have to repeat the same logic in different places.

 ## Parser

 The parser is generated by [Bison], an LALR(1) parser generator. The generated
 parser implementation is wrapped in an [adapter](verilog_parser.h) that allows
 it to work on tokens from any source, not just the lexer. This gives developers
 the opportunity to inserter filtering or transformation passes between the lexer
 and parser.

 The parser outputs a [concrete syntax tree (CST)](../CST) whose generic nodes
 are "typed" using [enumerations](../CST/verilog_nonterminals.h).

 We provide a
 [standalone tool for examining the CST for any valid SV source file](../tools/syntax).

 ## Lexical Context

 Parsing SystemVerilog is wrought with challenges that defy conventional LR-
 grammars. [LexicalContext](verilog_lexical_context.h) is a token transformation
 pass that aims to help the Bison-generated parser implementation by:

 1.  Disambiguating tokens that are used in multiple syntactic contexts.
 1.  Performing distance lookahead that would otherwise not be expressible in an
     LR grammar.

 It operates like a composition of state-machines that scan and mutate tokens.

 <!-- reference links -->

 [SV-LRM]: https://ieeexplore.ieee.org/document/8299595
 [Bison]: https://www.gnu.org/software/bison/
 [Flex]: https://www.gnu.org/software/flex/
	# SystemVerilog Lexer and Parser

	<!--*
	freshness: { owner: 'hzeller' reviewed: '2020-10-07' }
	*-->

	This directory contains the SystemVerilog lexer and parser implementations. The
	goal for the parser is to be able to accept all valid SystemVerilog (IEEE
	1800-2017), as defined in the [SV-LRM]. As of 2019, it accepts the vast majority
	of SystemVerilog syntax, but there is work ahead to reach 100%. Progress towards
	this goal is measured against open-source language-compliance tests at
	https://symbiflow.github.io/sv-tests/.

	Unlike conventional toolchains' parsers that expect preprocessed forms as input,
	this parser accepts unpreprocessed code with some limitations. Thus,
	preprocessing directives are accommodated in the
	[implemented grammar](verilog.y).

	## Decoupled Design

	The lexer and parser are decoupled, which means that the lexer can be used
	standalone to tokenize text, and the parser is adapted to accept tokens from
	sources other than the direct use of the lexer. This separation enables the
	insertion of different passes between the lexer and parser, such as integrated
	preprocessing, and context-based lexical disambiguation (with arbitrary
	lookahead) where required by the language.

	## Lexer

	The lexer is generated by [Flex]. Token enumerations come from the
	[parser](verilog.y). The generated lexer implementation is wrapped in an
	[adapter](verilog_lexer.h) that makes it return tokens (instead of just an
	`int`). The stream of tokens returned by the [lexer](verilog_lexer.h) have the
	following properties:

	* Continuity: The text range end of one token is equal to the text range
	start of the next token.
	* Completeness: The text range spanned by the first and last tokens is
	equal to that of the original text that was scanned.

	This also means insignficant whitespace text (spaces, newlines) are represented
	as tokens, and non-syntax tokens such as comments are included. Such tokens are
	easy to filter out before passing them onto the parsing phase.

	This lexer follows SystemVerilog lexical definitions, including that of the
	preprocessing sub-language, because it is targeted at _unpreprocessed_ code.

	We provide a
	[standalone tool for examining tokens for any valid SV source file](../tools/syntax).

	### Token Classifications

	[Token classification functions](verilog_token_classifications.h) provides
	functions that logically group together sets of token enumerations, so that
	client code does not have to repeat the same logic in different places.

	## Parser

	The parser is generated by [Bison], an LALR(1) parser generator. The generated
	parser implementation is wrapped in an [adapter](verilog_parser.h) that allows
	it to work on tokens from any source, not just the lexer. This gives developers
	the opportunity to inserter filtering or transformation passes between the lexer
	and parser.

	The parser outputs a [concrete syntax tree (CST)](../CST) whose generic nodes
	are "typed" using [enumerations](../CST/verilog_nonterminals.h).

	We provide a
	[standalone tool for examining the CST for any valid SV source file](../tools/syntax).

	## Lexical Context

	Parsing SystemVerilog is wrought with challenges that defy conventional LR-
	grammars. [LexicalContext](verilog_lexical_context.h) is a token transformation
	pass that aims to help the Bison-generated parser implementation by:

	1. Disambiguating tokens that are used in multiple syntactic contexts.
	1. Performing distance lookahead that would otherwise not be expressible in an
	LR grammar.

	It operates like a composition of state-machines that scan and mutate tokens.

	<!-- reference links -->

	[SV-LRM]: https://ieeexplore.ieee.org/document/8299595
	[Bison]: https://www.gnu.org/software/bison/
	[Flex]: https://www.gnu.org/software/flex/