# betterv_controlled_verilog_generation_with_discriminative_guidance__6f915c44.pdf Better V: Controlled Verilog Generation with Discriminative Guidance Zehua Pei 1 Hui-Ling Zhen 2 Mingxuan Yuan 2 Yu Huang 2 Bei Yu 1 Due to the growing complexity of modern Integrated Circuits (ICs), there is a need for automated circuit design methods. Recent years have seen increasing research in hardware design language generation to facilitate the design process. In this work, we propose a Verilog generation framework, Better V, which fine-tunes large language models (LLMs) on processed domainspecific datasets and incorporates generative discriminators for guidance on particular design demands. Verilog modules are collected, filtered, and processed from the internet to form a clean and abundant dataset. Instruct-tuning methods are specially designed to fine-tune the LLMs to understand knowledge about Verilog. Furthermore, data are augmented to enrich the training set and are also used to train a generative discriminator on particular downstream tasks, providing guidance for the LLMs to optimize Verilog implementation. Better V has the ability to generate syntactically and functionally correct Verilog, outperforming GPT-4 on the Verilog Eval benchmark. With the help of task-specific generative discriminators, Better V achieves remarkable improvements on various electronic design automation (EDA) downstream tasks, including netlist node reduction for synthesis and verification runtime reduction with Boolean Satisfiability (SAT) solving. 1. Introduction Large language models (LLMs) are leading the world due to their strong capability in generating and understanding natural language at a massive scale, which makes their potential applications and benefits vast across various domains and tasks. In the field of coding, LLMs can act as seasoned assistants for developers by providing advice on programming, 1The Chinese University of Hong Kong, Hong Kong SAR 2Noah s Ark Lab, Huawei, Hong Kong SAR. Correspondence to: Bei Yu . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). finding and fixing bugs in code, and even generating entire codebases from descriptions (Chen et al., 2021; Nijkamp et al., 2023). Electronic design automation (EDA) is a set of software and services for designing integrated circuits (ICs), which work together in the design flow to conceive and analyze circuit designs. The slowing down of Moore s Law puts increasing pressure on EDA, creating an emergent need to further improve and automate the design flow. Therefore, it is expected that LLMs will be incorporated into the EDA flow. LLMs in EDA have already achieved remarkable success in various tasks (He et al., 2023; Liu et al., 2023a). Hardware design languages (HDLs), such as Verilog and VHDL, describe the hardware design at the very beginning of the design flow, playing an important role in the EDA flow and strongly influencing the subsequent stages. However, writing HDL is time-consuming and prone to bugs, making it more expensive for today s complex ICs. Therefore, it is promising to utilize LLMs to automatically generate the desired HDL. Recently, several works have focused on Verilog generation (Dehaerne et al., 2023; Thakur et al., 2023). However, these works primarily refine LLMs with selected datasets, neglecting the direct integration of syntactic or functional correctness in the model s development. Moreover, they insufficiently address pivotal downstream tasks in the EDA process, which should be crucial evaluations for the generated Verilog. Despite the expectation that LLMs can play an important role in Verilog generation, several challenges need to be addressed. Firstly, the complex and strict requirements of hardware designs restrain LLMs from learning and understanding the knowledge related to Verilog. Secondly, there are limited Verilog resources available globally, which often leads to problems of overfitting and data bias during LLM fine-tuning. Moreover, considering the complicated and varied downstream tasks further complicates the issue. In industrial production, optimizing downstream targets from Verilog design to physical implementation involves numerous iterations and cannot be accomplished in a single step. This makes it difficult for LLMs to anticipate future stages and impractical to fine-tune the LLMs for each downstream task. Controllable text generation involves techniques that train extra discriminators to guide LLMs in desired directions (Scialom et al., 2020). However, these works primarily Better V: Controlled Verilog Generation with Discriminative Guidance focus on controlling natural language text representation during their development. In this work, we explore the potential of using this technique for optimization tasks, which is more challenging and fundamentally different from merely handling text representation. In this work, we propose a framework, Better V, for Verilog generation by instruct-tuning LLMs on our processed datasets and incorporating generative discriminators to optimize Verilog implementation for various downstream tasks. We utilize the alignment between Verilog and C programs to help LLMs effectively understand Verilog. Additionally, we propose data augmentation to address the issue of limited Verilog data. Furthermore, we recognize that various Verilog implementations can differ significantly in terms of Power, Performance, and Area (PPA) or verification runtime. Therefore, we transform the problems in downstream tasks into optimization challenges for Verilog implementation. The generative discriminator guides the LLMs to generate or modify Verilog implementations directly from natural language, making them as efficient as possible for downstream tasks and effectively reducing the number of iterations in industrial production. With these techniques, Better V successfully imparts domain-specific knowledge to LLMs and can adapt to any Verilog-related downstream tasks. The contributions of this paper are summarized as follows: Better V represents a groundbreaking development as the first endeavor to apply controllable text generation to engineering optimization challenges, specifically in optimizing downstream tasks in Electronic Design Automation (EDA). This approach not only introduces an innovative and promising research trajectory in EDA but also holds potential for application in optimization issues across various other domains. Better V marks a pioneering advancement as the first downstream task-driven method for Verilog generation. Our experiments employing various generative discriminators on specific Electronic Design Automation (EDA) tasks have demonstrated notable effectiveness. This innovative approach is characterized by its task-specific discriminator guidance, enhancing both its training efficiency and practical utility. Utilizing fine-tuned 6.7B/7B-parameter LLMs, without the application of prompt-engineering strategies, Better V demonstrates the capacity to generate syntactically and functionally correct Verilog, which surpasses GPT-4 when evaluated on the Verilog Eval benchmark. Better V offers a versatile solution for data augmentation, tailored to meet diverse specifications in Verilog implementation. This approach addresses the issue posed by the scarcity of Verilog resources effectively. 2. Related Works In this section, we briefly introduce some advancements and applications of LLMs for Verilog generation in Section 2.1. We also discuss the development of discriminator-guided controllable generation in Section 2.2. 2.1. LLMs for Verilog Generation Large language models (LLMs) have shown remarkable performance on code generation with either training a model from the beginning or fine-tuning a pre-trained model (Nijkamp et al., 2023; Roziere et al., 2023). The success of LLMs on code generation arouses the interest of study of LLMs on hardware design. As a widely recognized hardware description language (HDL), the generation of Verilog using LLMs has garnered significant attention and undergone extensive exploration. Some studies pay attention to construct customized datasets for the fine-tuning of LLMs, such as (Thakur et al., 2023) and (Dehaerne et al., 2023), who collect Verilog from the internet and process the data prior to training. Verilog Eval (Liu et al., 2023b) and RTLCoder (Liu et al., 2023c) further study the importance of problem descriptions and then generate various problemanswer pairs as dataset. RTLCoder also considers the quality feedback on different data by ranking them with scores during the training. Researchers also try utilizing the promptengineering techniques to enhance the generation ability, such as the self-planning used in RTLLM (Lu et al., 2023). Verilog Eval (Liu et al., 2023b) and RTLLM (Lu et al., 2023) build benchmarks to evaluate the generated Verilog on their functional or syntactic correctness. 2.2. Discriminator-guided Controllable Generation Controlling LLMs has been widely explored during recent years. Class-conditional language models (CCLMs), such as CTRL (Keskar et al., 2019), aim at controlling the generation by appending a control code in the beginning of training sequences. Discriminator-guided controllable generation is an important technique used for controllable generation, which combines a discriminator with the generative LLMs. The constraints are modeled by calculating the conditional probability on each class and related to the next-token probabilities by Bayes rule. (Holtzman et al., 2018) and (Scialom et al., 2020) predict the labels by feeding each candidate next token into a discriminator and hence guiding the generation to desired directions. PPLM (Dathathri et al., 2020) further employs a forward and backward process with the gradient from a discriminator to update the latent states of LLMs to guide the target generation. In order to reduce the cost of using discriminator for each possible next tokens, Ge Di (Krause et al., 2020) contrastively trains the CCLMs as generative discriminators to guide the generation during the decoding. DEXPERTS (Liu et al., 2021) attempts to further improve the performance by incorporating expert and anti-expert during the decoding. Better V: Controlled Verilog Generation with Discriminative Guidance 3. Algorithm In this section, we describe the framework of Better V, starting with an overview and then detailing the instruct-tuning process and the implementation of the generative discriminator. Pre-trained Code Knowledge Transfer Verilog Autocompletion Desired Verilog Undesired Verilog EDA-Tools Data-processing Instruct-tuning Discriminator training Generative Discriminator Data Augmentation Filtering Open-source Definition-Body Figure 1. The overview of Better V. 3.1. Framework Overview Better V conducts instruct-tuning for domain-specific understanding and combines generative discriminators to guide different EDA downstream tasks. The framework overview is demonstrated in Figure 1. Our approach first constructs a customized dataset from open-source Verilog, as described in Section 3.2, before the instruct-tuning stage. During instruct-tuning, various instruction problems are employed to teach the LLMs domain-specific knowledge, as detailed in Section 3.3. After that, we implement data augmentation to further enrich the dataset and prepare the labels for training the generative discriminator, as explained in Section 3.4. Finally, in Section 3.5, the generative discriminator is trained using a hybrid generative-discriminative loss and then guides the LLMs through Bayes rule. In the rest of this section, we present the details of Better V in guiding the LLMs for downstream EDA tasks optimization. 3.2. Instruct-Tuning Data-Processing Inspired by (Dehaerne et al., 2023), we collect open-source repositories on Git Hub that contain Verilog or System Verilog. At the same time, we check the repository licenses to ensure they permit modification and distribution. We filter out auto-generated files, which are mostly repetitive, and non-permissive files that contradict the licenses. As in (Dehaerne et al., 2023), we filter files with too many or too few lines. These files are then processed by extracting the Verilog modules and functions using regular expressions. We further analyze the extracted contents by measuring the token number after encoding them with the tokenizer and removing contents that exceed a predefined maximum token number. This is because contents exceeding the maximum token number will be truncated by the tokenizer during training, preventing the provision of meaningful information with incomplete contents. Finally, we filter out Verilog modules that have syntactic errors. To further support our instruct-tuning tasks, we implement two types of processing on the collected contents. First, we use a V2C tool to translate Verilog into C (Mukherjee et al., 2016), thereby transferring the hardware design information represented in Verilog into the C program. The dataset is then appended with the translated C programs that can be derived from the Verilog and do not exceed the maximum token number. Second, we split the Verilog modules and functions into their definitions (including the module header and input and output definitions) and their bodies. Finally, our constructed dataset contains a set of Verilog-C pairs and a set of Verilog definition-body pairs. 3.3. Domain-specific Instruct-Tuning The fine-tuning of LLMs is crucial because it determines how well the domain-specific knowledge is learned by the LLMs and how familiar the LLMs become with the customized tasks. This is particularly important for the task of Verilog generation since the existing corpus of Verilog is much smaller than that of general programming languages such as C and Python. The lack of Verilog corpus not only means that pre-trained LLMs learn less about it but also increases the importance of transferring more domainspecific knowledge through fine-tuning. Previous works implement fine-tuning by first generating problem descriptions with LLMs and then using the LLMs to generate the corresponding answers, i.e., the Verilog modules. However, such machine-generated descriptions are often verbose and fallible, failing to transfer useful and reliable knowledge during fine-tuning. Additionally, since this generation is done by LLM inference, it results in time consumption and resource occupancy, significantly increasing the implementation cost. To solve this issue, we propose domain-specific instructtuning, with examples shown in Figure 2. Firstly, we introduce a novel alignment method that maps Verilog code to C, facilitating the LLM s understanding of Verilog. We leverage the abundant knowledge that LLMs already have about general programming languages, such as C. We then construct instructions that ask the LLMs to translate Verilog into C and also translate from C to Verilog, allowing the LLMs to learn to understand Verilog through the implementation of C. With the help of the V2C tool, the dataset is easy to obtain, as described in the last section. Additionally, to improve the performance of LLMs in following Verilog generation instructions, we construct instructions for Verilog autocompletion, where the module definition is provided in the instruction and the completed module is the answer. It should be noted that we do not add extra natural language descriptions in instruct-tuning since the implementation of the C program already acts as a functional description to teach the LLMs. It should be noted that the LLMs follow an auto-regressive Better V: Controlled Verilog Generation with Discriminative Guidance System Prompt: You are an experienced Verilog engineer. Instruction: You are supposed to translate the following Verilog into C program. module d_latch( input d, input en, input rstn, output reg q ); always @ (en or rstn or d) if (!rstn) q <= 0; else if (en) q <= d; endmodule #include #include void d_latch( int d, bool en, bool rstn, int *q) { if (!rstn) { *q = 0; } else if (en) { *q = d; } } Instruction: Below is a definition of a Verilog module. And you are supposed to complete the Verilog program. module d_latch( input d, input en, input rstn, output reg q ); module d_latch( input d, input en, input rstn, output reg q ); always @ (en or rstn or d) if (!rstn) q <= 0; else if (en) q <= d; endmodule Figure 2. The examples for instruct-tuning. We take a simple d latch Verilog module as example. The left and right part indicate the code knowledge transfer and Verilog autocompletion, respectively. manner, where an LLM with parameters θ predicts the probability of a sequence x = {x1, ..., xn}, which includes both the instruction and the answer, and factorizes it using the chain rule of probability as follows (Keskar et al., 2019): t=1 pθ(xt|x τ are maintained and form the set Vτ. Therefore, the tokens that we keep before the sampling of generation are given by Vk = Vτ Vm. With this generative discriminator training pipeline and the equipped downstream task-specific augmented data, we can model any optimization scenario related to Verilog implementation and train a distinct discriminator to guide towards our desired directions. At the same time, the loops in industrial production and the manual cost can be effectively reduced by jointly considering the corresponding downstream constraints during the phase of Verilog generation. 4. Experiments In this section, we first introduce the experimental setting as well as the evaluation metrics in Section 4.1. Then in Section 4.2 we present our results on functional correctness and compared with other methods, followed by further results with different downstream tasks incorporated with the discriminator (including Synthesis Nodes Reduction in Section 4.3.1 Verification Runtime Reduction in Section 4.3.2) and ablation study in Section 4.4. 4.1. Experimental Setting We fine-tune the Code Llama-7B-Instruct (Roziere et al., 2023) and Deep Seek-Coder-6.7b-Instruct (Guo et al., 2024) Better V: Controlled Verilog Generation with Discriminative Guidance as generative LLMs, and fine-tune Tiny Llama (Zhang et al., 2024) and Deep Seek-Coder-1.3b-Instruct (Guo et al., 2024) as generative discriminator, respectively. For Code Qwen1.57B-Chat (Bai et al., 2023), we can only fine-tune it since there is no suitable smaller size version of it. Better V is trained with the help of Deep Speed Ze RO (Rajbhandari et al., 2020) and Lo RA (Hu et al., 2021) on the Deep Speed Chat (Yao et al., 2023). The experiments are conducted on a machine with two NVIDIA Tesla V100S PCIe 32 GB graphics cards with CUDA driver 11.4. We employ the Verilog Eval (Liu et al., 2023b), which comprises various problems with either machine-generated or human-crafted, as our evaluation benchmark. Following Verilog Eval, we measure the Verilog functional correctness with simulation through the pass@k metric with unbiased estimator: pass@k := E problems " 1 n c k n k where n k samples are generated per problem and a problem is solved if any of the k samples passes the unit tests. Due to the Plug-and-Play nature of the discriminator, Better V can work on various downstream tasks. For other downstream tasks besides the functional correctness, we can still utilize the pass@k metric and the problems in Verilog Eval but the correctness is measured by different tools in Yosys (Wolf) according to different downstream task. In our experiments, we sample n = 20 code completions per problem for each downstream task and measuring pass@k with k = 1,5,10. For all the models, we employ the Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9 and β2 = 0.95 and the cosine learning rate decay (Loshchilov & Hutter, 2016) to schedule our learning rate. For the generative LLMs fine-tuning process, we train it for 4 epochs using an initial learning rate of 9.65e-6 with a batch size of 4. For the generative discriminator, we train it for 3 epochs using an initial learning rate of 9.65e-6 with a batch size of 8. The Lo RA dimension for both LLMs and discriminator is set as 128. For different downstream tasks they have different sensitivity for the incorporation of generative discriminators, hence the setting of λ, w, ρ and τ are different and task-sensitive. We initialize the value of learnable scale α as 1 and bias b as 0. 4.2. Functional Correctness For functional correctness, the target is to correctly complete the Verilog module given the problem description and module definition. The performance of Better V is compared with GPT-3.5, GPT-4, Code Llama-7B-Instruct (Roziere et al., 2023), Deep Seek-Coder-6.7b-Instruct (Guo et al., 2024), Code Qwen1.5-7B-Chat (Bai et al., 2023), Chip Ne Mo (Liu et al., 2023a), Thakur et al. (Thakur et al., 2023), Verilog Eval (Liu et al., 2023b) and RTLCoder (Liu et al., 2023c). Table 1. Comparison of functional correctness on Verilog Eval. Model Verilog Eval-machine Verilog Eval-human pass@1 pass@5 pass@10 pass@1 pass@5 pass@10 GPT-3.5 46.7 69.1 74.1 26.7 45.8 51.7 GPT-4 60.0 70.6 73.5 43.5 55.8 58.9 Code Llama 43.1 47.1 47.7 18.2 22.7 24.3 Deep Seek 52.2 55.4 56.8 30.2 33.9 34.9 Code Qwen 46.5 54.9 56.4 22.5 26.1 28.0 Chip Ne Mo 43.4 - - 22.4 - - Thakur et al. 44.0 52.6 59.2 30.3 43.9 49.6 Verilog Eval 46.2 67.3 73.7 28.8 45.9 52.3 RTLCoder-Mistral 62.5 72.2 76.6 36.7 45.5 49.2 RTLCoder-Deep Seek 61.2 76.5 81.8 41.6 50.1 53.4 Better V-Code Llama 64.2 75.4 79.1 40.9 50.0 53.3 Better V-Deep Seek 67.8 79.1 84.0 45.9 53.3 57.6 Better V-Code Qwen 68.1 79.4 84.5 46.1 53.7 58.2 As shown in Table 1, the results demonstrate that our Better V-Code Qwen have achieved the state-of-the-art performance on Verilog Eval, where the pass@1 on Verilog Evalmachine and Verilog Eval-human outperforms GPT-4 by 8.1 and 2.6, respectively. It should be noted that the pre-trained model Code Llama-7B-Instruct has the lowest performance, but after instruct-tuning and the guidance of discriminator a huge performance is improved. 4.3. Customized Generation in Better V Besides the measurement on functional correctness, it is important to consider the performance in EDA downstream tasks, which is our customized generation. Note that in the rest of the paper, we conduct the experiments only based on the model Code Llama-7B-Instruct (Roziere et al., 2023) to demonstrate our results. Therefore, Better V is employed to replace Better V-Codellama , i.e. the model fine-tuned on Code Llama-7B-Instruct (Roziere et al., 2023). And Better V-base refers to the base models of Better VCode Llama , i.e. the model has undergone instruction tuning but have not yet enhanced by discriminator. 4.3.1. SYNTHESIS NODES REDUCTION Firstly, we need to consider the hardware attributes. Therefore, we train the discriminator to improve the performance of Verilog related to later EDA stage, i.e. the synthesis. We make the labels for discriminator depending on whether the logic networks that synthesised from the generated Verilog have less nodes than the reference Verilog. The netlist nodes number are collected by using the proc; aigmap; stat commands from Yosys, which transform the Verilog into an And-Inverter-Graph (AIG). The instruction becomes rewriting the reference to have less nodes after synthesis as shown in Figure 4. We select several problems in Verilog Evalhuman to intuitively demonstrate the performance of Better V, i.e. the reference nodes number and the generated nodes number are represented. The presented nodes number in the table are the average nodes number of the generated Verilog. As shown in Table 2, with the guidance from discriminator, Better V can always generate Verilog that has less netlist nodes. The last two columns refer to the proportion of improvements in node reduction compared to Better V: Controlled Verilog Generation with Discriminative Guidance Better V-Base ( Com Base ) and Reference ( Com Ref ) respectively. It can be found that, on average, the Verilog generated by Better V have 46.52% fewer nodes than the reference model and 31.68% less nodes than Better V-base. The success of Better V on the node number reduction after synthesis is significant, since it marks that LLMs start to participate in optimizing the PPA (Power, Performance, Area) of circuit design in EDA flow. Furthermore, optimizing the Verilog implementation at the beginning of the EDA flow has proved to have meaningful influence on later stages. Table 2. Synthesis nodes reduction with discriminator. Problem Ref Better V-base Better V Com Base Com Ref ece241 2013 q8 657 333.5 255.3 23.44% 61.14% m2041 q6 1370 692.7 685.6 1.03% 49.95% counter 2bc 673 666.2 518.9 22.11% 22.89% review2015 count1k 487 493.4 402.6 18.44% 17.33% timer 498 294.3 247.3 15.97% 50.34% edgedetect2 58 189.9 47.4 75.03% 18.27% counter1to10 325 266.3 240.3 9.76% 26.06% 2013 q2afsm 826 308.8 296.6 3.95% 64.09% dff8p 50 42.3 37.8 10.63% 24.4% fsm3comb 844 167.9 104.4 37.82% 87.63% rule90 6651 12435.6 4536.9 63.52% 31.79% mux256to1v 2376 2439.6 557.2 77.16% 76.54% fsm2 389 186.53 121.9 34.65% 68.66% fsm2s 396 163.7 144.1 11.97% 63.61% ece241 2013 q4 2222 1789.5 897.4 49.85% 59.61% conwaylife 43794 547400.3 27037.4 95.06% 38.26% count clock 3187 2497.5 2222.2 11.02% 30.27% countbcd 1589 932.0 849.3 8.87% 46.55% 4.3.2. VERIFICATION RUNTIME REDUCTION Better V can not only deal with the problem in synthesis, but also participate in the optimization on formal verification. In this section, we discuss the improvement on Boolean Satisfiability (SAT) runtime reduction when doing formal verification on the Verilog by rewriting the Verilog implementation. If we consider the concrete stages during the EDA flow, such as synthesis in last section, as the key for the circuit (PPA) performance, then the verification process decides how safe to produce a circuit. Since the engineers always spend plenty of time to carefully verify the circuit designs, the runtime of formal verification is always a bottleneck to improve its scalability. Therefore, it is essential to consider the impact of Verilog implementation to formal verification runtime. Since using the SAT is one of the mainstream of formal verification, we conduct experiments to reduce the SAT solving time. Table 3. Verification runtime reduction with discriminator. Design Ref Better V-base Better V Com Base Com Ref (s) (s) (s) b03 1.233 1.252 0.857 31.54% 30.49% b06 0.099 0.083 0.078 6.02% 21.21% Spinner 1.577 1.343 1.064 20.77% 32.53% traffic light example 0.583 0.497 0.480 3.42% 17.67% Rotate 1.153 1.126 1.034 8.17% 10.32% Instruction: Below is a Verilog module. And you are supposed to rewrite this Verilog module with the same definition but less And Inverter-Graph nodes after synthesis. module d_latch( input d, input en, input rstn, output reg q ); always @ (en or rstn or d) if (!rstn) q <= 0; else if (en) q <= d; endmodule Figure 4. An example to instruct the LLMs to rewrite the Verilog module to reduce the AIG nodes after synthesis. We make the labels for discriminator depending on whether generated System Verilog has less SAT solving time than the reference. The SAT solving time is collected by using the hierarchy; proc; opt; sat -verify -seq 100 -tempinduct -prove-asserts commands from Yosys, which solve the SAT problem to prove all the asserts in a circuit with 100 time steps. The instruction becomes rewriting the reference to reduce verification runtime for solving SAT problems, which is the same as the example shown in Figure 4 after replacing part of the description to but less Boolean Satisfiability (SAT) solving time . Since the Verilog Eval benchmark doesn t contain System Verilog that includes the assertions inside the design, we choose another benchmark for our experiment in this section. ANSI-C benchmarks give Verilog with safety assertions and can be used for the evaluation of Better V (Mukherjee et al., 2015). Some designs among them are selected to evaluate the performance of Better V. Note that for the Rotate case, we uncomment all the assertions to enhance the difficulty. As shown in Table 3, after the discrimiative guidance, Better V is able to further decrease the SAT solving time. We also show the improvement ratio between the verification time of Verilog generated by Better V and Better V-base ( Com Base ) and reference Verilog ( Com Ref ) in the last two columns respectively. It shows that the Verilog generated by Better V can save 22.45% in verification time compared with the reference Verilog, and 13.99% in time compared with Better V-base. The ability to reduce the verification runtime indicates that LLMs can understand how to rewrite the implementation to accelerate the provement. And it gives the hope that we will be able to solve more complex and complicated problems and hence facilitate the issue of scalability in formal verification. Better V: Controlled Verilog Generation with Discriminative Guidance Table 4. Impact of discriminator on functional correctness. Model pass@1 pass@5 pass@10 Code Llama 18.2 22.7 24.3 Code Llama + Dis 20.3 24.1 24.7 Better V-Code Llama-base 40.0 49.5 53.0 Better V-Code Llama 40.9 50.0 53.3 Table 5. Impact of discriminator on syntactic correctness. Model pass@1 pass@5 pass@10 Code Llama 41.5 50.8 53.8 Code Llama + Dis 49.1 58.4 61.1 Better V-Code Llama-base 82.6 97.6 99.2 Better V-Code Llama 87.1 98.2 99.3 4.4. Ablation Study 4.4.1. IMPACT OF DISCRIMINATOR We also show that the generative discriminator has the ability to further enhance the performance after the guidance. In the task of functional correctness, the labels of discriminator are made by checking whether the generated modules are functional equivalence with the reference module. The equivalence checking is done by the eqy tool in Yosys. With the problems in Verilog Eval-human, the results in Table 4 illustrate that the guidance from the discriminator can not only enhance the capability on our fine-tuned LLMs, i.e. these base models, but also on the original pre-trained LLMs, i.e. Code Llama. This observation indicates that our trained discriminator can be employed to any LLM, only if they have the same vocabulary size. For syntactic correctness, we only consider whether the syntax of the generated Verilog is corrected or not. We construct the labels of generative discriminator based on whether the generated module can be complied by the Yosys with the command prep , which is a generic synthesis script. We also evaluate the performance with the problems in Verilog Eval-human, which is demonstrated in Table 5 and the results show that in the task of generate syntactically correct Verilog, Better V can also achieve remarkable performance, i.e. over 99 pass@10. It can also be observed that the discriminator can help largely enhance the syntactic correctness, i.e. 7.6 and 4.5 improvement on pass@1 for the original Code Llama-7B-instruct model and the Better Vbase, respectively. 5. Discussion 5.1. Current limitations We have shown that Better V can achieve remarkable performance on Verilog generation and the generative discriminator enables improvements on Verilog-related EDA downstream tasks. While Better V contributes valuable insights and a practical method for enhancing the performance of open-source LLMs, it still has limitations on closed-source models due to the need for access to token-level probabilities. The research about guidance generation on closed- source models remains an unsolved issue. In addition, while the generative discriminator uses only a small model, the additional computational overhead it brings cannot be ignored. Research to reduce this overhead is needed to accelerate the guidance in an accuracy-optimal manner. For example, it is possible to use model inference acceleration methods such as quantization and pruning on the discriminator. 5.2. Applicability to other domains The potential applicability of our methods to other domains with similar challenges of data scarcity and precision requirements is promising. For example, many domains utilize structured languages (e.g., SQL for databases, La Te X for typesetting, and XML for web services). The principles behind Better V s instruct-tuning and generative discriminators can be adapted to improve the generation of such languages, particularly where precision and correctness are non-negotiable. Fields such as genomics, where data is highly structured and errors can have significant consequences, could benefit from our approach. The instructtuning methodology could be adapted to understand the syntax and semantics of genetic sequences, while generative discriminators could be used to guide the generation of sequences to complex biological constraints. 6. Conclusion In summary, this paper introduces a novel framework, Better V, for Verilog generation, aimed at controlling Verilog implementation and optimizing its performance across various aspects. We present a complete and easy-to-follow method for collecting and processing Verilog data. Our domain-specific instruct-tuning successfully teaches large language models (LLMs) the knowledge of Verilog using our processed data from the internet. A data augmentation process is proposed to further enhance the diversity of the dataset. Specific generative discriminators are then trained to meet the requirements of different downstream tasks in the electronic design automation (EDA) flow. The experimental results demonstrate the state-of-the-art capabilities of Better V on various tasks. Better V marks a pioneering advancement in optimizing the PPA (Power, Performance, Area) of circuit design and introduces a promising direction to accelerate the verification process. Future work could explore additional domain-specific adaptations and incorporate other powerful techniques in LLMs to further enhance Better V s performance and applicability. Impact Statement This paper aims to contribute to the advancement of the Machine Learning field. While our work may have various societal implications, we do not find it necessary to emphasize any particular consequences here. Better V: Controlled Verilog Generation with Discriminative Guidance Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical report. ar Xiv preprint ar Xiv:2309.16609, 2023. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021. Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. In International Conference on Learning Representations (ICLR), 2020. Dehaerne, E., Dey, B., Halder, S., and De Gendt, S. A Deep Learning Framework for Verilog Autocompletion Towards Design and Verification Automation. ar Xiv preprint ar Xiv:2304.13840, 2023. Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al. Deepseek-coder: When the large language model meets programming the rise of code intelligence. ar Xiv preprint ar Xiv:2401.14196, 2024. He, Z., Wu, H., Zhang, X., Yao, X., Zheng, S., Zheng, H., and Yu, B. Chat EDA: A large language model powered autonomous agent for EDA. In ACM/IEEE Workshop on Machine Learning CAD (MLCAD), pp. 1 6. IEEE, 2023. Holtzman, A., Buys, J., Forbes, M., Bosselut, A., Golub, D., and Choi, Y. Learning to write with cooperative discriminators. ar Xiv preprint ar Xiv:1805.06087, 2018. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021. Keskar, N. S., Mc Cann, B., Varshney, L. R., Xiong, C., and Socher, R. Ctrl: A conditional transformer language model for controllable generation. ar Xiv preprint ar Xiv:1909.05858, 2019. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Krause, B., Gotmare, A. D., Mc Cann, B., Keskar, N. S., Joty, S., Socher, R., and Rajani, N. F. Gedi: Generative discriminator guided sequence generation. ar Xiv preprint ar Xiv:2009.06367, 2020. Liu, A., Sap, M., Lu, X., Swayamdipta, S., Bhagavatula, C., Smith, N. A., and Choi, Y. DExperts: Decoding-time controlled text generation with experts and anti-experts. ar Xiv preprint ar Xiv:2105.03023, 2021. Liu, M., Ene, T.-D., Kirby, R., Cheng, C., Pinckney, N., Liang, R., Alben, J., Anand, H., Banerjee, S., Bayraktaroglu, I., et al. Chip Ne Mo: Domain-Adapted LLMs for Chip Design. ar Xiv preprint ar Xiv:2311.00176, 2023a. Liu, M., Pinckney, N., Khailany, B., and Ren, H. Verilog Eval: Evaluating Large Language Models for Verilog Code Generation. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1 8. IEEE, 2023b. Liu, S., Fang, W., Lu, Y., Zhang, Q., Zhang, H., and Xie, Z. RTLCoder: Outperforming GPT-3.5 in Design RTL Generation with Our Open-Source Dataset and Lightweight Solution. ar Xiv preprint ar Xiv:2312.08617, 2023c. Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016. Lu, Y., Liu, S., Zhang, Q., and Xie, Z. RTLLM: An opensource benchmark for design rtl generation with large language model. ar Xiv preprint ar Xiv:2308.05345, 2023. Mukherjee, R., Kroening, D., and Melham, T. Hardware verification using software analyzers. In IEEE Computer Society Annual Symposium on VLSI, pp. 7 12. IEEE, 2015. ISBN 978-1-4799-8719-1. Mukherjee, R., Tautschnig, M., and Kroening, D. v2c a Verilog to C translator tool. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS), volume 9636 of LNCS, pp. 580 586. Springer, 2016. ISBN 978-3-662-49673-2. Nijkamp, E., Hayashi, H., Xiong, C., Savarese, S., and Zhou, Y. Codegen2: Lessons for training llms on programming and natural languages. ar Xiv preprint ar Xiv:2305.02309, 2023. Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In ACM/IEEE Supercomputing Conference (SC), pp. 1 16. IEEE, 2020. Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al. Code llama: Open foundation models for code. ar Xiv preprint ar Xiv:2308.12950, 2023. Scialom, T., Dray, P.-A., Lamprier, S., Piwowarski, B., and Staiano, J. Discriminative adversarial search for abstractive summarization. In International Conference on Machine Learning (ICML), pp. 8555 8564. PMLR, 2020. Thakur, S., Ahmad, B., Fan, Z., Pearce, H., Tan, B., Karri, R., Dolan-Gavitt, B., and Garg, S. Benchmarking Large Language Models for Automated Verilog RTL Code Generation. In IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE), pp. 1 6. IEEE, 2023. Wolf, C. Yosys open synthesis suite. https://yosyshq. net/yosys/. Yao, Z., Aminabadi, R. Y., Ruwase, O., Rajbhandari, S., Wu, X., Awan, A. A., Rasley, J., Zhang, M., Li, C., Holmes, C., Zhou, Z., Wyatt, M., Smith, M., Kurilenko, L., Qin, H., Tanaka, M., Che, S., Song, S. L., and He, Y. Deep Speed-Chat: Easy, Fast and Affordable RLHF Training of Chat GPT-like Models at All Scales. ar Xiv preprint ar Xiv:2308.01320, 2023. Zhang, P., Zeng, G., Wang, T., and Lu, W. Tiny Llama: An Open-Source Small Language Model, 2024.