blob: 3901370e9e18286acf4792f20cba73274a2d03d6 [file] [log] [blame] [edit]
\section{Adding Partitioning to the CAD Flow}
\label{sec:CADflow}
Given that the number of signals that can cross the interposer between FPGA dice is considerably more limited than within an FPGA, using a partitioner to divide the circuit into one partition per die is a promising CAD flow.
The overall CAD flow, including the partitioning steps, is shown in Figure \ref{fig:partitioning_cad_flow}.
\begin{figure}[!htbp]
\centering
\includegraphics[height=3.5in]{partitioning_cad_flow.eps}
\caption{Possible ways to add partitioning to the VPR CAD flow.}
\label{fig:partitioning_cad_flow}
\end{figure}
\subsection{Partitioner Stage}\label{sec:partitioner_stage}
As shown in Figure \ref{fig:partitioning_cad_flow}, we explore the effect of circuit partitioning at two different stages of the CAD flow. The first possibility is to partition the netlist of primitive elements (LUTs, FFs, RAM slices, etc.) before packing into the larger function blocks such as logic blocks and (32 kb) RAM blocks. The partitioning constraints must be respected by the packer; it must not pack two primitives assigned to different partitions (dice) into the same function block. We updated the VPR clustering algorithm to respect this new constraint. The second possible way to incorporate partitioning is after packing, where partitioning is applied to the function block (cluster-level) netlist. Each partition is then assigned (randomly, though this is not optimal) to a separate FPGA die. The partitioning constraints must then be respected by the placer; it is free to move function blocks around within a die, but not across interposer cutlines. These two partitioning flow options can be combined -- if partitioning is applied before packing, one can also pass the partition information to the placement stage. The post-packing placement constraints can be derived from the primitive block partitioning constraints, with the valid placement region for a given function block being equal to the intersection of the valid placement regions of the primitives packed into that function (clustered) block. There are, therefore, four possible partitioning flows:
\begin{enumerate}[label=(\roman*)]
\item Leaving both the packer and the placer unconstrained. This is the base flow that does not use partitioning, as described in Section \ref{cadSection}.
\item Partitioning the primitive netlist to guide the packer, leaving the placer unconstrained.
\item Leaving the packer unconstrained, and partitioning the function block (clustered) netlist to guide the placer.
\item Partitioning the primitive netlist to guide the packer, and, as described above, passing these constraints to the placer.
\end{enumerate}
\subsection{Partitioning Tool}
Two well-known freely available graph partitioning tools are Metis~\cite{karypis1998multilevelmetis} and hMetis~\cite{karypis1999multilevelhmetis}. Metis operates on graphs, while hMetis operates on hypergraphs -- a hypergraph is a generalization of a graph, where each edge can connect any number of vertices. It is most natural to represent a circuit netlist as a hypergraph, where the vertices are netlist pins (or blocks) and the hyperedges are the nets joining the pins. Owing to the natural hypergraph representation, hMetis would be the most appropriate choice of partitioner. However, another important issue is the resource balancing ability of each tool.
Graph partitioning algorithms assign vertices to partitions while satisfying ``balance constraints''. The balance constraints exist to drive the algorithm away from the trivial partitioning solution where all vertices are in the same partition, leaving all other partitions empty. An example of a basic balance constraint is one which requires an equal number of vertices in each partition. Generally such a perfect balance overly restricts the solution, so some fractional ``unbalance'' is allowed; in this work we use $0.05$ ($5\%$). For two partitions, $P1$ and $P2$, this translates to~\cite{karypismanual}
\begin{equation}\label{eq:balance_constraint}
\frac{|P1|}{|P1 \cup P2|} \in [0.5 - unbalance, 0.5 + unbalance]
\end{equation}
\hl{We chose an unbalance of 0.05 to allow the partitioner some flexibility, while still ensuring the partitions are sufficiently equal in size to make good use of each die. Highly unbalanced partitions would leave some dice nearly empty, defeating the key advantage of interposer-based FPGAs: larger capacity.} FPGA circuit netlists may contain different block types such as LUTs, registers, RAM slices, and multipliers, among others. The basic balance constraint of (\ref{eq:balance_constraint}), applied to a netlist with different block types, would ensure that the total block count (over all blocks, regardless of type) in each partition is roughly balanced. The problem with (\ref{eq:balance_constraint}) in the case of multiple block types is that the capacity, per block type, of each partition is not taken into account. It is therefore possible to generate partitions that cannot be realized -- for example, there is nothing preventing the partitioner from moving all LUTs to partition 1 and all multipliers to partition 2, which may not be realizable given the per-block-type capacity of each partition.
One solution is to enforce a balance constraint on each block type, $i$:
\begin{equation}
\forall i, \abs*{ \frac{| P1_i |}{|P1_i \cup P2_i |} - 0.5} < unbalance
\end{equation}
where $P1_i$ is the set of all blocks of type $i$ in partition 1. We refer to this type of per-block set of constraints as heterogeneous balance constraints. Note that there are complex legality constraints governing which LUTs and FFs can be packed into legal logic blocks, as well as which multipliers can be packed into legal DSP blocks~\cite{luu2014vtr}. As the partitioner is not aware of these constraints, flow 3 above (which runs the partitioner after function block packing) will be able to more precisely balance resource use across partitions.
The current version of hMetis does not support heterogeneous balance constraints, though Metis does, so we use Metis in our CAD flow. However, as the circuit netlist is naturally represented as a hypergraph, it needs to be transformed to a graph before Metis can process it. \hl{Our experiments showed that hMetis performed 30\% better on average when compared to Metis on our benchmark circuits with homogeneous balance constraints.}
\subsection{Hypergraph to Graph Transformation}\label{sec:hypergraph_to_graph}
We transformed the circuit netlist hypergraph to a graph in the following way:
\begin{enumerate}[label=(\roman*)]
\item For each hyperedge of the circuit netlist hypergraph, we generated a graph (the \emph{per-net subgraph}).
\item We combined all of the per-net subgraphs, summing the edge weights for edges appearing in more than one per-net subgraph, to generate the total netlist graph.
\end{enumerate}
We explored several ways to generate the per-net subgraphs, varying the graph topology and the edge weight scheme. We considered two graph topologies, clique and star, as illustrated in Figure \ref{fig:star_clique}. \hl{Ref.\mbox{\cite{hypergraph_to_graph_survey}} surveys several methods of transforming hypergraphs to graphs, and considers the addition of ``dummy'' nodes to the star model. In our star model, we preferred to designate the net source as the center of the star, rather than introduce a dummy node. This choice was motivated by a desire to optimize timing -- the star topology clearly differentiates sources from sinks, and we expect this to guide the partitioner to keep sources and sinks in the same partition.}
We assigned the same edge weight to every edge of the generated subgraph. Edge weights were computed based on the number of vertices in the originating hyperedge, $n$, and we considered edge weights equal to $1$, $1/n$, and $1/n^2$. Our intuition is that assigning lower edge weight to high-fanout nets may be beneficial, and we selected the edge weights accordingly. We refer to the combination of graph topology and edge weight scheme as the \emph{hyperedge model}.
\begin{figure}[!htbp]
\centering
\subfigure[]{\includegraphics[width=0.35\linewidth]{star.eps}}
\subfigure[]{\includegraphics[width=0.35\linewidth]{clique.eps}}
\caption{(a) Star and (b) clique graph topologies.}
\label{fig:star_clique}
\end{figure}
Figure \ref{fig:hyperedge_to_graph} shows how a netlist of three blocks and two nets is transformed into its component subgraphs.
\begin{figure}[!htbp]
\centering
\subfigure[]{\includegraphics[width=0.4\linewidth]{hg2g_star.pdf}}
\subfigure[]{\includegraphics[width=0.4\linewidth]{hg2g_clique.pdf}}
\caption{The hypergraph to graph transformation with (a) the star topology, and (b) the clique topology.}
\label{fig:hyperedge_to_graph}
\end{figure}
To select the best hyperedge model, we used the Metis partitioner to generate graph partitions and applied the partitioning result to the original untransformed hypergraph. We used the \emph{hyperedge cutsize} to rank the different hyperedge models, across the same circuits as in Section \ref{sec:CADeffect}. The hyperedge cutsize is the number of hyperedges (nets) crossing the cutline, as illustrated in Figure \ref{fig:hyperedge_cutline}, and it measures circuit routability. The results are shown in Figure \ref{fig:graph_topology_cutsize}.
\begin{figure}[!htbp]
\centering
\includegraphics[width=0.75\linewidth]{hyperedge_cutline.pdf}
\caption{A graph with six vertices, three hyperedges, and two partitions. As there are two hyperedges crossing the cutline, the hyperedge cutsize is 2.}
\label{fig:hyperedge_cutline}
\end{figure}
\begin{figure}[!htbp]
\centering
%\includegraphics[width=\linewidth]{graph_topology_cutsize.eps}
%\input{graph_topology_cutsize.tex}
\input{graph_topology_cutsize}
\caption{Hypergraph cut size achieved by Metis with different hyperedge models and weightings. $n$ is the number of vertices on a hyperedge.}
\label{fig:graph_topology_cutsize}
\end{figure}
The performance of the clique topology is significantly worse than that of the star topology, regardless of edge weight scheme. An intuitive explanation for this result is that with the clique topology, the partitioner does not have knowledge of which vertex is the source of a net and which vertices are sinks. In contrast, the star topology clearly differentiates sources from sinks and this appears to give the \hl{iterative refinement based} partitioner an “anchor point” that pulls all hyperedge fanouts toward the source. Additionally, if $n$ is the average number of vertices in a hyperedge, the clique topology generates $O(n^2)$ edges while the star topology generates only $O(n)$ edges, so the star topology saves both memory and runtime.
\hl{For both topologies, the $1/n$ edge weight scheme gives the smallest cut size on average. This can be explained intuitively in terms of the total edge weight over all edges of the graph. The star $1$ (constant) edge weight scheme assigns a total weight of $n - 1$ to each net, which heavily weights high-fanout nets. In contrast, the star $1/n$ model assigns a total weight of $1 - \frac{1}{n}$ to each net. The star $1/n^2$ model penalizes high-fanout nets relative to lower-fanout nets. Since we seek to minimize the hyperedge cut, it makes intuitive sense to weight all hyperedges equally, and the star $1/n$ model best achieves this.}
To validate the choice of hyperedge cutsize as a proxy for circuit routability, we ran the partitioning CAD flow for each hyperedge model, across several circuits. We use an unbalance of $5\%$, split the clustered netlist into two partitions (constraining only the placer), cut $80\%$ of the wires crossing the interposer, and impose a $1ns$ delay penalty for wires crossing the interposer. To accommodate unbalance in the placement engine, we increase the size of the grid of complex blocks that make up the FPGA device. Relative to the minimum device size required for placement (without partitioning constraints), we add $10\%$ to the complex block grid width and $10\%$ to the grid height. Figure \ref{fig:graph_topology_mcw} shows the geometric mean of the minimum channel width required for a successful route, for each hyperedge model.
\begin{figure}[!htbp]
\centering
\input{graph_topology_mcw}
%\includegraphics[width=\linewidth]{graph_topology_mcw.eps}
\caption{Minimum channel width achieved by VPR with different hyperedge models and weightings. n is the number of vertices on a hyperedge.}
\label{fig:graph_topology_mcw}
\end{figure}
The star topology with $1/n$ edge weights achieves the best minimum channel width, confirming that hyperedge cutsize is a good proxy for routability. Consequently, we use this model in all future results in this paper. Unlike routability, we found that the post-routing critical path delay was not strongly impacted by the hypergraph model used.
\subsection{Partitioner Stage Results}\label{sec:partitioner_stage_results}
We compare the performance of the four CAD flow variations described in Section \ref{sec:partitioner_stage} on the 8 largest VTR benchmarks. We again use an unbalance of $5\%$ and split the clustered netlist into two partitions (constraining the packer and/or the placer, as required by the flow under test). The interposer parameters (\% wires cut = $80\%$ and delay increase = 1 ns) and bloat factor ($10\%$ in each dimension) are the same as in Section \ref{sec:hypergraph_to_graph}. Figure \ref{fig:flows_mcw} shows the minimum routable channel width for each partitioning CAD flow variation.
\begin{figure}[!htbp]
\centering
%\includegraphics[width=\linewidth]{vpr_flows_mcw.eps}
\input{vpr_flows_mcw}
\caption{Minimum channel width achieved by VPR for each partitioning flow.}
\label{fig:flows_mcw}
\end{figure}
On average, the best minimum channel width was obtained when constraining both packing and placement. However, constraining packing alone performed significantly better than constraining placement alone. This shows that the VPR packer strongly benefits from partitioning constraints when targeting an interposer-based architecture, likely because it forbids the packing of primitives that are not naturally related into one function block. Interestingly, other recent work targeting conventional FPGAs has shown that using a partitioner to guide packing benefits routability\cite{feng2014rent}. On the other hand, The VPR placement algorithm with the enhancements described in Section \ref{cad_enh_placer_subsection} sees only a modest routability benefit from the partitioning constraints.
For the same set of circuits, we computed the critical path delay at a channel width equal to 1.3x the minimum. The results are shown in Figure \ref{fig:flows_crit_path}.
\begin{figure}[!htbp]
\centering
\input{vpr_flows_crit_path}
\caption{Critical path delay achieved by VPR for each partitioning flow.}
\label{fig:flows_crit_path}
\end{figure}
On average, the critical path delay did not vary significantly between the different partitioning CAD flows, but it appears that partitioning before packing may slightly improve circuit speed, while enforcing these partitioning constraints during placement may slightly reduce circuit speed.
Considering both routability and delay, the best CAD flow used the partitioning result to constrain both packing and placement, and we use that flow in the remainder of this paper.
\subsection{\hl{Interposer Wire Utilization}}
\hl{Table \mbox{\ref{table:interposer_wire_utilization}} shows the number of connections on the critical path and how many of those critical connections cross the interposer boundary. It also shows the number of cut nets per vertical channel. This provides a lower bound on the number of wires required in an interposer channel. The data was obtained at 70\% wires cut.}
\begin{table}[!htbp]
\centering
\caption{Interposer wire utilization}
\begin{tabular}{|l|r|r||r|r|}
\hline
& \multicolumn{2}{c||}{Critical Connections} & \multicolumn{2}{c|}{Cut Nets} \\
& Cut & Total & Total & Per Channel \\ \hline
bgm & 1 & 373 & 622 & 8.89 \\ \hline
stereovision0 & 3 & 57 & 985 & 25 \\ \hline
stereovision1 & 1 & 21 & 876 & 21 \\ \hline
LU8PEEng & 0 & 1661 & 1795 & 20 \\ \hline
mkDelayWorker32B & 4 & 58 & 771 & 17 \\ \hline
mcml & 5 & 1443 & 2891 & 28 \\ \hline
LU32PEEng & 0 & 1664 & 1130 & 5 \\ \hline
\end{tabular}
\label{table:interposer_wire_utilization}
\end{table}
\hl{Though our choice of the star topology benefits timing, our partitioning flow is not timing-driven. We would expect a timing-driven partitioner to reduce the number of cut critical connections, but obtaining good critical path estimates at the partitioning stage is nontrivial\mbox{\cite{hutton}}.}
\subsection{\% wires cut}
To evaluate the CAD flow performance for a range of interposer architectures, we varied the fraction of wires cut and measured the average minimum channel width over the 8 largest VTR circuits. As shown in Figure \ref{fig:wires_cut}, this new flow is a significant improvement over that of~\cite{interposer2014} for FPGAs where more than half the wires are cut at interposer boundaries. \hl{But it is slightly worse than the traditional FPGA CAD flow for an architecture with 0\% of the wires cut}, which is a monolithic FPGA, presumably because we have restricted the placer's freedom.
\begin{figure}[!htbp]
\centering
\input{wires_cut}
\caption{Impact of partitioning-based CAD flow on routability for a range of architectures.\protect\footnotemark}
\label{fig:wires_cut}
\end{figure}
\footnotetext{The best flow from~\cite{interposer2014} does not provide data at 90\% wires cut.}
\subsection{Bloating of the Clustered Netlist due to Partitioning Constraints}
The additional constraints on the packer imposed by partitioning may result in a less dense packing. Figure \ref{fig:packing_bloat} compares the number of complex blocks in the partitioning-constrained and unconstrained clustered netlists.
\begin{figure}[!htbp]
\centering
\input{packing_bloat}
\caption{The number of blocks in the clustered netlist with and without partitioning constraints. This is a measure of packing bloat.}
\label{fig:packing_bloat}
\end{figure}
On average, the packing-constrained netlist contained $3.2\%$ more complex blocks than the unconstrained netlist. The LU8PEEng and LU32PEEng circuits were the most affected by this packing bloat. In those circuits, we found that the packing bloat was due almost entirely to poor packing of memory blocks; an interesting area for future work would be to augment the partitioner to understand that some RAM primitives are best kept together in one partition.