This section describes the various subsystems of the Verilog formatter.
Between every pair of adjacent tokens in the to-be-formatted text, there are a set of attributes that influence spacing and line wrapping decisions, InterTokenInfo, such as minimum spacing between tokens.
The role of the token annotator is to populate InterTokenInfo based on the attributes, types, and syntactic context of adjacent pairs of tokens. Formatting rules around punctuation tokens like parenthesis and brackets often depend on syntactic context.
The main functions that determine inter-token attributes are written as priority-ordered rules, i.e. series of if (condition) return ...;
. There may be an O(N^2) number of inter-token pair combinations, however, we aim to cover as much of that space as possible with few rules. Test cases, however, should cover as much as possible explicitly, to prevent accidental regression.
When running with logging enabled, you will observe that every value or attribute returned by these functions is accompanied with a reason string like “Always insert a space after X”. This not only tells you the value returned but the exact line or rule that triggered it, which is critical for debugging. When fixing issues, don't immediately think of what code to add to solve an issue, consider whether a re-ordering of the rules may work. In some case, code can be simplified by grouping similar case conditions together, or by generalizing rules more broadly.
TokenPartitionTree is a language-agnostic hierarchical representation of (formatting) token ranges. The data represented by each node is an UnwrappedLine, which contains a range of formatting tokens and indentation information. TokenPartitionTree has the following properties:
The role of the TreeUnwrapper is to convert a SystemVerilog concrete syntax tree into a TokenPartitionTree. Once in TokenPartitionTree form, the original syntax structure has been abstracted away and is no longer relevant to the rest of the formatter. If one chose to implement the formatter without relying on a valid syntax tree, one only has to produce a TokenPartitionTree form to reuse everything else that follows. For example, one could just as well produce a TokenPartitionTree directly from a lexical token stream. Such a strategy could better accomodate preprocessing-heavy code which might otherwise be rejected by the parser before producing a syntax tree, at the expense of having to do some pseudo-parsing.
Think of each node in the TokenPartitionTree representing a formatting subproblem: Given a box width constraint, what are the best ways to format a region of text? Leaf partitions of the TokenPartitionTree represents a unit of work that is currently handled by the line-wrapping optimization phase.
The implemention of TreeUnwrapper is split into language-agnostic library functions and Verilog-specific code.
Simplified class inheritance diagram is shown below:
TreeUnwrapper class exploits the Visitor pattern to recursively process the input CST tree:
// SymbolVisitor is an abstract visitor class from which visitors can be derived // SymbolVisitor only visits a single Symbol, and does not recurse on nodes // // Usage: // SymbolPtr tree = ... // SymbolVisitorSubclass visitor; // tree->Accept(visitor); // class SymbolVisitor { public: virtual ~SymbolVisitor() {} virtual void Visit(const SyntaxTreeLeaf& leaf) = 0; virtual void Visit(const SyntaxTreeNode& node) = 0; };
It builds the TokenPartitionTree (i.e. VectorTree<UnwrappedLine>
) with a set of helper methods to simplify the process:
TraverseChildren()
- only traverses the CSTVisitIndentedSection()
- creates indented section and traverses the CSTVisitNewUnwrappedLine()
- begins a new partition of tokens and traverses the CSTThe next stage of code formatting is to apply custom functions that would reshape the partition tree according to the selected policy. An example of that are the kTabularAlignment and the kAppendFittingSubPartitions policies.
Line wrapping optimization is currently done as a Dijsktra shortest-path search over the space of append/wrap decisions spanning a range of formatting tokens. When a wrap (line break) is used, the optimization state incurs a penalty value determined in the token annotation phase. This algorithm is not very scalable, but there are plans to replace it with something like dynamic programming.
This section provides tips about how to triage formatter issues.
verible-verilog-syntax --printtree
shows the concrete syntax tree representation of source code. Many formatting decisions are based on syntax tree node types, which are shown as constants like kPackageDeclaration
.
verible-verilog-format --show_token_partition_tree
shows the TokenPartitionTree subdivision of formatting token ranges and relative indentation. Add --show_inter_token_info
to show even more information about InterTokenInfo before each token.
A lot of other details are traced by running with environment variables VERIBLE_LOGTHRESHOLD=0 VERIBLE_VLOG_DETAIL=8
. The log threshold indicates that we want to see info logs and up (there are four levels: INFO=0, WARNING=1, ERROR=2, FATAL=3) printed to stderr (without that we wouldn't see any vlog outputs as they go to INFO), and vlog detail is logging all verbose logging commands (VLOG()
) up to and including VLOG detail 8; the higher the number, the more details.
(TODO: introduce VERIBLE_VLOG_MODULE
once vmodule available in absl log). Real world TreeUnwrapper example:
echo 'module m; initial a = b + c; endmodule' | \ VERIBLE_LOGTHRESHOLD=0 \ VERIBLE_VLOG_DETAIL=8 \ bazel run //verible/verilog/tools/formatter:verible-verilog-format -- -show_token_partition_tree -
The tabular alignment feature can be summarized as: Given a block of code with syntax-structure similarity, insert spacing needed to achieve vertical alignment among the same substructure elements.
The implementation of tabular alignment for code is split into language-agnostic library functions and Verilog-specific uses of alignment functions.
To trigger alignment for a block of code, the TreeUnwrapper marks a TokenPartitionTree node with kTabularAlignment
: this says that a particular node has some children partitions that could be aligned.
Once a TokenPartitionTree enters the top-level alignment handler, alignment is done in the following phases:
Language-specific: Depending on the syntax tree node type that corresponds to each partition, a function is chosen that is responsible for finding AlignablePartitionGroups, blocks of children partitions that are candidates for aligned formatting. Think of these as groups of rows that constitute a table.
Language-agnostic: Each alignment group analyzes and transforms its rows and columns, essentially calculating the maximum width of each column, and adjusting InterTokenInfo spacing, taking into account left/right flushing.
[N,M,P]
. This makes it easy to accommodate empty cells in each row, and also accommodates a dynamic number of columns. For example, data declarations could have any number of columns based on dimensions. The path-hierarchical nature and lexicographical ordering of keys faciliates dynamic column generation.Should the resulting column widths exceed the column limit (style parameter), then attempting to align a block of code is abandoned, falling back to non-aligned formatting.