5/17/2020

Chip Placement with Deep Reinforcement Learning

Azalia Mirhoseini, et al., Chip Placement with Deep Reinforcement Learning, arXiv:2004.10746.
In this work, we present a learning-based approach to chip placement, one of the most complex and time-consuming stages of the chip design process. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously unseen chip blocks. To achieve these results, we pose placement as a Reinforcement Learning (RL) problem and train an agent to place the nodes of a chip netlist onto a chip canvas. To enable our RL policy to generalize to unseen blocks, we ground representation learning in the supervised task of predicting placement quality. By designing a neural architecture that can accurately predict reward across a wide variety of netlists and their placements, we are able to generate rich feature embeddings of the input netlists. We then use this architecture as the encoder of our policy and value networks to enable transfer learning. Our objective is to minimize PPA (power, performance, and area), and we show that, in under 6 hours, our method can generate placements that are superhuman or comparable on modern accelerator netlists, whereas existing baselines require human experts in the loop and take several weeks.
Framework
RL problems can be formulated as Markov Decision Processes (MDPs), consisting of four key elements:   
  • states: the set of possible states of the world (e.g., in our case, every possible partial placement of the netlist onto the chip canvas).
  • actions: the set of actions that can be taken by the agent (e.g., given the current macro to place, the available actions are the set of all the locations in the discrete canvas space (grid cells) onto which that macro can be placed without violating any hard constraints on density or blockages).
  • state transition: given a state and an action, this is the probability distribution over next states. 
  • reward: the reward for taking an action in a state. (e.g., in our case, the reward is 0 for all actions except the last action where the reward is a negative weighted sum of proxy wirelength and congestion, subject to density constraints as described in Section 3.3).
Reward (equation 4)
Our goal in this work is to minimize power, performance and area, subject to constraints on routing congestion and density. Our true reward is the output of a commercial EDA tool, including wirelength, routing congestion, density, power, timing, and area. However, RL policies require 100,000s of examples to learn effectively, so it is critical that the reward function be fast to evaluate, ideally running in a few milliseconds. In order to be effective, these approximate reward functions must also be positively correlated with the true reward. Therefore, a component of our cost is wirelength, because it is not only much cheaper to evaluate, but also correlates with power and performance (timing). 
Anna Goldie and Azalia Mirhoseini, Chip Design with Deep Reinforcement Learning, Google blog, April 23, 2020.
The input to our model is the chip netlist (node types and graph adjacency information), the ID of the current node to be placed, and some netlist metadata, such as the total number of wires, macros, and standard cell clusters. The netlist graph and the current node are passed through an edge-based graph neural network that we developed to encode the input state. This generates embeddings of the partially placed graph and the candidate node. 

The edge, macro and netlist metadata embeddings are then concatenated to form a single state embedding, which is passed to a feedforward neural network. The output of the feedforward network is a learned representation that captures the useful features and serves as input to the policy and value networks. The policy network generates a probability distribution over all possible grid cells onto which the current node could be placed. 
In each iteration of training, the macros are sequentially placed by the RL agent, after which the standard cell clusters are placed by a force-directed method, which models the circuit as a system of springs to minimize wirelength. RL training is guided by a fast-but-approximate reward signal calculated for each of the agent’s chip placements using the weighted average of approximate wirelength (i.e., the half-perimeter wirelength, HPWL) and approximate congestion (the fraction of routing resources consumed by the placed netlist).

Result
Comparing our method with the state-of-the-art (RePlAce (Cheng et al., 2019)) method and manual expert placements using an industry standard electronic design automation (EDA) tool. For all metrics in this table, lower is better.

沒有留言:

張貼留言