# docprompting_generating_code_by_retrieving_the_docs__ec05cf06.pdf

Published as a conference paper at ICLR 2023

Doc Prompting: GENERATING CODE BY RETRIEVING THE DOCS

Shuyan Zhou , Uri Alon

Frank F. Xu , Zhiruo Wang , Zhengbao Jiang , Graham Neubig

Language Technologies Institute, Carnegie Mellon University, Inspired Cognition {shuyanzh,ualon,fangzhex,zhiruow,zhengbaj,gneubig}@cs.cmu.edu

Publicly available source-code libraries are continuously growing and changing. This makes it impossible for models of code to keep current with all available APIs by simply training these models on existing code repositories. Thus, existing models inherently cannot generalize to using unseen functions and libraries, because these would never appear in their training data. In contrast, when human programmers use functions and libraries for the first time, they frequently refer to textual resources such as code manuals and documentation, to explore and understand the available functionality. Inspired by this observation, we introduce Doc Prompting: a natural-language-to-code generation approach that explicitly leverages code documentation by (1) retrieving the relevant documentation pieces given a natural language (NL) intent, and (2) generating code based on the NL intent and the retrieved documentation. Doc Prompting is general: it can be applied to any programming language, and is agnostic to the underlying neural model. We demonstrate that Doc Prompting consistently improves NL-to-code models: Doc Prompting improves strong base models such as Code T5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation on the popular Python Co Na La benchmark; on a new Bash dataset tldr, Doc Prompting improves Code T5 and GPT-Neo-1.3B by up to absolute 6.9% exact match. 1

1 INTRODUCTION

We address the task of natural language to code generation (NL code): generating a code snippet, written in a general-purpose programming language such as Python or Bash, given a natural language intent. This task has seen sharply growing popularity recently due to the emergence of large language models trained on vast amounts of natural language and code (Chen et al., 2021; Xu et al., 2022; Fried et al., 2022). NL code models facilitate programming for both professional and inexperienced programmers, by allowing programmers to write code by only expressing their higher-level intent.

Many existing code generation models either learn directly from input-output pairs provided as training data (Allamanis et al., 2015; Yin and Neubig, 2017; Iyer et al., 2018; Brockschmidt et al., 2019; Xu et al., 2020; Alon et al., 2020; Wang et al., 2021), or learn the mapping between input and output implicitly from naturally occurring corpora of intertwined natural language and code (Austin et al., 2021; Nijkamp et al., 2022). Nevertheless, all these works assume that all libraries and function calls were seen in the training data; and that at test time, the trained model will need to generate only seen libraries and function calls. However, new functions and libraries are introduced all the time, and even a seen function call can have unseen arguments. Thus, these existing models inherently cannot generalize to generate such unseen usages.

In contrast to these existing models, human programmers frequently refer to manuals and documentation when writing code (Nykaza et al., 2002; Lethbridge et al., 2003). This allows humans to easily use functions and libraries they have never seen nor used before. Inspired by this ability,

1Data and code are available at https://github.com/shuyanzhou/docprompting.

Published as a conference paper at ICLR 2023

Generate HTML with python syntax highlighting for print( reading docs ) Re!iever Genera"r

d3 Pygment is a generic syntax highlighter

A lexer splits the source into tokens, fragments class Python Lexer For Python source code

A formatter takes the token stream and writes it to an output file class Html Formatter Format tokens as HTML 4 <span> tags with

from pygments import * code = print( reading docs ) s = highlight(code, Python Lexer(), Html Formatter())

Figure 1: Doc Prompting: given an NL intent n , the retriever retrieves a set of relevant documentation { d1 ,d2 , d3 } from a documentation pool D . Then, the generator generates the code c based on the NL and retrieved docs. Doc Prompting allows the model to generalize to previously unseen usages by reading those docs. Italic blue highlights the shared tokens between NL and docs; Bold shows shared tokens between docs and the code snippet.

we propose Doc Prompting: a code generation approach that learns to retrieve code documentation before generating the code. An overview of our approach is illustrated in Figure 1: First, a document retriever uses the NL intent n to retrieve relevant code documentation { d1 ,d2 , d3 } from a documentation pool D . Then, a code generator uses these docs in its prompt to generate the corresponding code c . The documentation pool serves as an external data store that can be updated frequently with new contents (e.g., documentation of newly released libraries), without re-training any model component. This way, Doc Prompting can leverage newly added documentation, and it can generate code containing unseen and unused functions and libraries. Doc Prompting is general and applicable to any programming language and underlying base architecture. To the best of our knowledge, this is the first demonstration of leveraging documentation in models of code explicitly and effectively.

We demonstrate the effectiveness of Doc Prompting on two NL code benchmarks and tasks, across two programming languages, and using several base models: GPT-Neo (Black et al., 2021), T5 (Raffel et al., 2020), Code T5 (Wang et al., 2021), Fusion-in-Decoder (Izacard and Grave, 2021)), and Codex (Chen et al., 2021). Further, we experiment with both sparse retrievers such as BM25 (Robertson and Jones, 1976) and dense retrieval models such as Sim CSE (Gao et al., 2021). Finally, we introduce two new benchmarks for retrieval-based code generation: (a) in Bash, we curate a new benchmark by crawling the tldr repository, and constructing the training/development/test splits without overlapping commands; (b) in Python, we re-split the popular Co Na La benchmark (Yin et al., 2018) by making every test example contain at least one Python function that is not seen in the training data. Models that use Doc Prompting consistently outperform their base models that generate code solely based on the NL intents. Using Doc Prompting improves strong base models such as Code T5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation in Co Na La; on the new tldr dataset, Doc Prompting improves Code T5 and GPT-Neo1.3B by up to absolute 6.9% exact match. We release our new benchmarks, including annotation of oracle documents for each example and pools of documentation, to serve as a test-bed for future retrieval-based code generation models.

2 CODE GENERATION BY READING THE DOCS

Our underlying assumption is that code documentation is the most exhaustive yet succinct resource for most libraries and programming languages (Roehm et al., 2012), and that documentation allows to effectively generalize to unseen libraries and functions (Forward and Lethbridge, 2002). We follow the retrieve-then-generate paradigm (Lewis et al., 2020; Guu et al., 2020), focusing on retrieving documentation. In this section, we describe the general approach of Doc Prompting; in 3 and 6.2, we elaborate and experiment with practical implementations of Doc Prompting.

Formulation Given NL intent n, our goal is to generate a corresponding code snippet c written in some programming language (PL) such as Python. We assume that a model has access to a collection of code documentation D. Each document di D describes the usage of a library, a function, or an

Published as a conference paper at ICLR 2023

argument in that PL. The construction of D is flexible: it can either be a comprehensive set of all available libraries and functions in a PL, or a customized subset for the scope of a specific project.

2.1 BACKGROUND: RETRIEVAL-CONDITIONED GENERATION

Although a model may use the entire collection of documents D, only a few documents in D are relevant for any particular intent. Further, it is usually computationally infeasible to directly condition on the entire, unbounded, collection of documents while making predictions. Thus, we first let the model select a subset of documents Dn = {d1,d2,..,dk} D that are potentially relevant given n, and refer to this subset while generating c.

Overall, we decompose the probability of generating c into the probability of choosing a particular subset of documents P (Dn D,n), and the probability of generating the code conditioned on the intent and the selected documents P (c Dn,n); finally, we marginalizing over all Dn D:

P (c D, n) = Dn D P (c Dn, n) P (Dn D, n) (1)

assuming that c is independent of D given Dn (that is, (c á D Dn)). Since enumerating all possible subsets Dn is computationally infeasible, we follow the common practice and approximate the marginalization over Dn in Equation (1) by taking the most probable subset of retrieved documents ˆ Dn, and then conditioning the prediction of c on these most likely documents:

ˆ Dn = argmax Dn DP (Dn D, n) P(c D, n) P(c ˆ Dn, n) P( ˆ Dn D, n) (2)

2.2 Doc Prompting: GENERATING CODE BY RETRIEVING THE DOCS

Equation 2 implies that Doc Prompting relies of two main components: A retriever R retrieves relevant documents ˆ Dn given the intent n; and a generator G generates the code snippet c conditioned on the retrieved documents ˆ Dn and the intent n, which compose a new prompt. Specifically, R computes a similarity score s(di,n) between a intent n and every document di D. Thus, the subset ˆ Dn D is the top-k documents with the highest similarity scores: ˆ Dn = top-kdi D (s(di,n)).

An overview of our approach is illustrated in Figure 1: given the intent Generate HTML with python syntax highlighting for print( reading docs ) , the retriever R retrieves three relevant documents: d1 describes the syntax highlighting library pygments, d2 describes the class Python Lexer, and d3 describes the Html Formatter class. Given these docs and the intent, the generator G generates the code snippet c, which uses Python Lexer and Html Formatter from the pygment library.

3 PRACTICAL INSTANTIATIONS OF Doc Prompting

Doc Prompting is a general approach that is not bound to any specific model choices, and it can be instantiated with any base retriever and generator. This section presents the concrete instantiations of R and G that we found to provide the best performance in our experiments.

3.1 RETRIEVER INSTANTIATION

We experiment with two main types of retrievers: sparse retrievers and dense retrievers. As our sparse retriever, we use Elasticsearch2 with the standard BM25 (Robertson and Jones, 1976). This retriever represents documents using sparse features that rely on word frequencies, such as BM25 and TF-IDF.

As our dense retriever, we follow prior work (Chen et al., 2020; Karpukhin et al., 2020; Gao et al., 2021): given a triplet (n,c,D n), where D n are the oracle docs for n, each d+ i D n and n form a positive pair (n,d+ i ), while each d j D n and n form a negative pair (ni,d j ). We train the retriever in a contrastive fashion where the similarity score of a positive pair is maximized while that of in-batch negative pairs is minimized. For a pair (ni,d+ i ), the loss function is defined as:

Lr = log exp (sim(hn, hd+ i ))

exp (sim(hn, hd+ i )) + d j B/D n exp (sim(hn, hd j )) (3)

2https://github.com/elastic/elasticsearch

Published as a conference paper at ICLR 2023

where hx is the representation of x computed by a neural encoder, and B are positive docs for other examples in the batch. We define sim(hx,hy) as the cosine similarity between hx and hy.

We use all (ni,d+ i ) in the training set as our supervised training dataset. Additionally, we use all sentences in the documentation pool for weak supervision: Following Chen et al. (2020) and Gao et al. (2021), representations of the same sentence with different dropout masks are treated as a positive example. Instead of using either supervised or weakly supervised training as in Gao et al. (2021), we simply mix the two resulting supervision signals, and examples are randomly distributed into batches. This mixture of tasks not only facilitates the learning process ( 6.2), but also reduces the engineering effort required to store and reload models for separate supervised and unsupervised training phases. We initialize the retriever encoder with either the best model of Gao et al. (2021) or the encoder of Code T5-base (Wang et al., 2021). Additional training details are provided in Appendix C

3.2 GENERATOR INSTANTIATION

We experimented with a variety of generator models. We used GPT-Neo-125M, GPT-Neo-1.3B (Black et al., 2021) and Codex (Chen et al., 2021), where we concatenate the retrieved documents and the NL intent as a single, long, prompt. T5-base (Raffel et al., 2019) and Code T5-base (Wang et al., 2021) have a shorter input size of 512 tokens, which is sometimes too short for the concatenation of multiple docs. Thus, for T5 and Code T5 we apply the fusion-in-decoder approach (Fi D; Izacard and Grave, 2021): we first concatenate the intent n with each retrieved di ˆDn and encode each (n,di) pair independently. Then, the decoder attends to all encoded NL-document pairs. We finetune the generator to maximize the log-likelihood of the reference code c given n and ˆDn.

With Codex (Chen et al., 2021), we performed few-shot learning rather than finetuning because the model parameters are not publicly available. We constructed the prompt with three static examples, each of which is a concatenation of retrieved documentation, an NL intent and the reference code snippet. We then appended the test example and its retrieved documentation to the few-shot examples. We used the code-davinci-001 version because we suspect potential leakage of the test set into the training set of code-davinci-002. See more details in Appendix H. Training details, hyper-parameter settings and example prompts can be found in Appendices E and D.

4 EXPERIMENTAL SETUP

We evaluate Doc Prompting on two NL code tasks: shell scripting ( 4.1), in which we generate complex shell commands given an intent, and Python programming ( 4.2), where we generate answers in Python for NL questions. In this section, we first introduce a newly curated benchmark tldr; we then describe our re-split of the popular Co Na La benchmark (Yin et al., 2018). For each benchmark, we provide a global documentation pool D that is shared for all examples and oracle documents D n which we use to train the retriever. We release our newly curated benchmarks to serve as test-bed for future retrieval-based code generation models.

4.1 SHELL SCRIPTING

Figure 2: An example NL-code pair from tldr, along with three oracle documentation items.

tldr is a community-driven project that maintains easilyreadable help pages with examples for over 2.5k Bash commands in over 25 natural languages3. We collected pairs of English intents and Bash command lines. The NL intents are written by human users, and the Bash commands range from popular ones like cat and tar, to uncommon commands such as toilet and faketime. Our resulting tldr benchmark contains 1,879 unique Bash commands and 9,187 NL Bash pairs. We constructed the training, development and the test set with completely disjoint commands to test the generalizability of a code generation model. The shared documentation pool D is made up of the 400k paragraphs from the 1,879 Bash manuals. Each paragraph describes a single concept such as an

3https://github.com/tldr-pages/tldr

Published as a conference paper at ICLR 2023

argument flag. We further curated the oracle documents D n for each example using simple string matching. An example from tldr is shown in Figure 2. To the best of our knowledge, this is the first work to leverage tldr as an NL code benchmark. Detailed statistics and additional details are provided in Appendix A. In tldr, each NL intent results in a single Bash command with a combination of argument flags. We therefore first retrieve an entire Bash manual; then, we take the top manual and retrieve the top-10 paragraphs from that manual.

Evaluation metrics We measure: (a) command name accuracy (CMD Acc) whether the command name (e.g., cat) is an exact match; (b) exact match (EM) exact match between the reference and the generation; (c) token-level F1; and (d) character-level BLEU (char BLEU; Lin et al., 2018; Shi et al., 2022). In all metrics, we disregard user-specific variable names in the references and the models outputs. For example, mycli -u [user] -h [host] [database] is evaluated as mycli -u $1 -h $2 $3 .

4.2 PYTHON PROGRAMMING

Co Na La (Yin et al., 2018) is a popular benchmark for NL Python generation. NL intents are Stack Overflow questions, and code snippets are their answers. Both intents and code snippets are rewritten by human annotators. We re-split the dataset to test models generalization to unseen Python functions. In our re-split, we verifed that every example in the development or the test set uses at least one Python function (e.g., plt.plot) that was not seen in the training data. In addition, we make sure that the examples from the same Stack Overflow posts are in the same set to prevent leakage. This re-split results in 2,135/201/543 examples in the training/development/test sets, respectively.

The Co Na La documentation pool D contains 35,763 documents, each describing a single function, from all Python libraries available on Dev Docs (https://devdocs.io). These include built-in libraries and other popular libraries such as numpy. We constructed the oracle docs D n for each example by matching all function names in the target code c with docs. More details in Appendix B.

Evaluation metrics We follow Yin et al. (2018) and measure BLEU-4. Since we focus on generalization to unseen functions, we additionally report function name recall (recall) and unseen function recall (recallunseen), which measures recall among function calls that do not appear in the training set. Finally, following Chen et al. (2021); Austin et al. (2021), we used the manually written unit tests from Wang et al. (2022) for 100 examples from Co Na La s test set and measure pass@k. We followed Chen et al. (2021) and performed nucleus sampling (Holtzman et al., 2019) with p = 0.95. For each k, we searched for the best temperature for each model from {0.2,0.4,0.6,0.8,1.0}. On average, each example has 2.03 tests. The concatenation of multiple Python docs often exceeded the length limit of GPT-Neo, we hence experimented in this dataset with Fi D, which allows longer inputs. Additional details are provided in Appendix B.

In all following results, all models with Doc Prompting use the top-10 retrieved docs from the best retriever on that dataset (Table 4). Every baseline uses the exact same setup as its +Doc Prompting version, except for not using the documentation.

5.1 SHELL SCRIPTING RESULTS

Results for tldr are shown in Table 1. Doc Prompting consistently improves the base models. For example, T5+Doc Prompting achieves more than twice higher accuracy in predicting the command name, more than 16 char BLEU points on the entire prediction, and almost 9% of absolute exact match gain, compared to the vanilla T5. In the few-shot learning setting with Codex, Doc Prompting brings gains of 6.7 char BLEU points, and consistent improvement across all metrics over the baseline that observes only NL-code pairs in its prompt. These results show that retrieving documentation also benefits strong models such as Codex, and with only few examples in the context.

Code generation with oracle command names In realistic settings, a human programmer may know the command name they need to use (e.g., awk), but not know the exact usage and flags. In fact, better understanding of the usage of known commands is the purpose of Unix man pages and the

Published as a conference paper at ICLR 2023

Table 1: Results on shell scripting, using a BM25 retriever with top-10 retrieved docs, on the test set of tldr. For the oracle command name experiments, we selected the best model of each type.

Model CMD Acc (%) EM (%) Token F1 char BLEU

GPT-Neo-125M - 11.96 1.94 28.75 19.99 +Doc Prompting 25.32 3.56 31.23 24.43

GPT-Neo-1.3B - 14.55 3.12 32.46 24.70 +Doc Prompting 27.59 9.05 37.24 30.57

T5 - 10.02 0.76 19.90 25.48 +Doc Prompting 30.28 9.16 37.58 31.97

Code T5 - 14.60 2.18 30.00 21.50 +Doc Prompting 30.72 9.15 36.71 33.83

Codex 3-shots - 27.48 8.94 36.04 16.94 +Doc Prompting 31.21 9.29 36.77 23.72

With the oracle command name

T5 - - 12.96 59.36 45.05 +Doc Prompting - 22.55 64.84 54.28

Codex 3-shots - - 22.44 62.26 50.29 +Doc Prompting - 32.43 69.73 55.21

Table 2: Comparison to approaches that retrieve examples (Parvez et al., 2021; Pasupat et al., 2021)

. Model CMD Acc (%) EM (%) Token F1 char BLEU

GPT-Neo-125M +Ex Prompting 6.68 0.32 20.49 11.15 +Doc Prompting 25.32 3.56 31.23 24.43

GPT-Neo-1.3B +Ex Prompting 14.01 2.8 30.07 22.11 +Doc Prompting 27.59 9.05 37.24 30.57

tldr project. We conducted an oracle experiment where we provided T5 (which was the strongest model using Doc Prompting) and Codex with the oracle command name (e.g., awk). This oracle information is provided to both the baseline and the model that uses Doc Prompting. The results are shown on the bottom part of Table 1. When the oracle command is given, Doc Prompting further improves over the base models. For example, when providing Codex with the ground truth command name, Doc Prompting improves its exact match from 22.44% to 32.43%.

Should we retrieve documentation or examples? All existing retrieval-based models of code retrieve NL-code pairs or code snippets, rather than documentation. To simulate this scenario, we followed Parvez et al. (2021) and Pasupat et al. (2021) to retrieve NL-code pairs from the training set of tldr, and refer to this baseline as Ex Prompting. We finetuned the best retriever Ro BERTa and two generators, and retrieved the top-30 NL-code pairs for every example. As shown in Table 2, retrieving documentation (Doc Prompting) provides much higher gains than retrieving examples (Ex Prompting). Theoretically, adding examples of unseen commands can help Ex Prompting generalize to them as well. However, new libraries and functions may not have available examples on the web yet, while documentation often does becomes available when the library is released.

5.2 PYTHON PROGRAMMING RESULTS

Table 3 shows the results on Co Na La. Code T5+Doc Prompting yields a 1.65 BLEU improvement over the state-of-the-art baseline that was initialized with Code T5.4 When measuring the recall of the generated function names, the benefit of Doc Prompting is especially higher for unseen functions (recallunseen). For example, Doc Prompting achieves 18.30 compared to only 9.03 of the base Code T5 in unseen functions. Additionally, Doc Prompting improves in-context learning setting with Codex.

4In a separate experiment on the original split of Co Na La, this baseline achieved a BLEU score of 39.12, which outperforms the previous state-of-the-art (Beau and Crabb e, 2022) by 4.92 BLEU points.

Published as a conference paper at ICLR 2023

Table 3: Results on Co Na La, using a Code T5 retriever with top-10 retrieved docs. Function recall (Recall) measures how many functions in the reference code are correctly predicted, and unseen function recall (Recallunseen) only considers the subset held out from the training data.

Model BLEU Recall Recallunseen

Codex 3-shots - 43.16 39.52 - + Doc Prompting 43.47 39.87 -

+ Doc Prompting oracle docs 50.59 57.84 -

T5 - 28.07 14.36 2.57 + Doc Prompting 30.04 21.34 8.24

Code T5 - 34.57 24.24 9.03 + Doc Prompting 36.22 27.80 18.30

+ Doc Prompting oracle docs 49.04 72.20 63.91

110 50 100 200 0 5 10 15 20 25 30 35 40

23.38 25.54 27.08

+Doc Prompting Code T5

Figure 3: Pass@k of Code T5 with and without Doc Prompting on 100 Co Na La examples.

14% 11% 9% 7%

(NL+Docs) Code

Figure 4: Using documentation significantly increases the n-gram overlap recall between the input and the output, in tldr and Co Na La.

We hypothesis that the minor gain is mainly due to the potential data leakage of Codex, which violates the split of seen and unseen functions. Another reason is that a strong generator such as Codex may require an equally strong retriever as well. We find that Codex can achieve even higher results with an oracle retriever, which shows the potential further improvement by improving the retrievers. Finally, Code T5 performs better than T5, with and without using Doc Prompting. This emphasizes the importance of using code-specific pretrained models.

Execution-based evaluation The results are shown in Figure 3. Using Doc Prompting consistently outperforms the baseline Code T5 for all values of pass@k. For example, Doc Prompting yields 2.85% improvement on pass@1 and 4.45% improvement on pass@5, which are realistic numbers of completions that can be suggested in an IDE. When k = 200, Doc Prompting widens the gap to 8.38%. These results demonstrate that Doc Prompting does not only improve the quality of the generated code in its surface form, but also increase its functional correctness. Additional details and results are provided in Appendix G.

6.1 WHY DOES READING THE DOCUMENTATION HELP GENERATING MORE ACCURATE CODE?

We believe that one of the major reasons is that documentation eases the mapping between NL intents and code, since the documentation contains both NL descriptions and function signatures. We calculated the n-gram overlap between the NL intents and their corresponding code snippets (NL code), and the overlap between the NL intents with their top-10 retrieved documents and their code snippets ((NL+docs) code). As shown in Figure 4, adding documentation significantly increases the overlap across n-grams, and increase, for example, the unigram overlap from 12% to

Published as a conference paper at ICLR 2023

Table 4: Retrieval performance of multiple models on the dev set of tldr (top) and Co Na La (bottom). Ro BERTa is the best model taken from from Gao et al. (2021), and Code T5 is the encoder of Code T5base (Wang et al., 2021). Models with the subscript off-shelf are the off-the-shelf models, and the other models were finetuned with the objective in Equation 3. The last column is the best model (Ro BERTa for tldr and Code T5 for Co Na La) trained without the weak supervision corpus.

n BM25 Ro BERTaoff-shelf Ro BERTa Code T5off-shelf Code T5 Best w/o weak sup.

1 32.81 17.53 30.03 10.45 18.10 28.30 5 51.73 37.89 52.50 20.26 38.52 50.50 10 59.86 46.80 60.33 25.73 51.03 59.84 20 62.01 56.11 64.30 33.65 57.26 62.30

1 3.01 4.46 13.49 4.60 16.54 10.51 5 7.16 7.58 26.38 8.63 42.35 21.15 10 9.73 10.93 34.86 12.25 55.81 29.34 20 11.46 13.89 45.46 18.46 66.79 42.21

24% in tldr. That is, one of the reasons that retrieving documentation helps generating accurate code is that documentation bridges the gap between the intent terminology and the code terminology .

6.2 ABLATION STUDY

We compared different configurations of the retriever, to gather more insights for effective Doc Prompting. Table 4 shows a comparison between different retrievers and their setups. First, the performance of BM25 varies among datasets: In tldr, BM25 matches the recall of trained dense retrievers; however in Co Na La, BM25 achieves only recall@10 of 9.73%, and strong dense retrievers such as the encoder of Code T5 achieve recall@10 of 55.81. We hypothesize that this difference between datasets stems from the ways these datasets were created: tldr intents were written based on existing Bash commands and manuals; while Co Na La examples were mined from Stack Overflow posts, where users ask questions with limited or no context. Thus, NL intents in Co Na La require a better semantic alignment with the documents, and thus benefit from dense retrievers. The gap resulting from different data curation processes was also observed by Rodriguez and Boyd-Graber (2021) in open-domain question answering (QA).

Second, retrievers that were pretrained on the target programming language are generally stronger. For example in Co Na La, Code T5 which was pretrained on Python, is both a better off-the-shelf retriever and a better finetuned-retriever than Ro BERTa, which was pretrained mainly on text. In contrast, tldr is based on Bash, which neither Code T5 nor Ro BERTa were explicitly pretrained on. Thus, tldr benefits mostly from BM25 and Ro BERTa rather than Code T5 as retrievers.

Finally, training the retriever using weak supervision on the documentation pool (Section 3.1) dramatically improves the retriever. The recall of the best retrievers of each dataset without this corpus is shown in the last column of Table 4 ( Best w/o weak sup. ). On Co Na La, removing this corpus results in severe performance degradation. One possible explanation is that this weak supervision helps the retriever perform domain adaptation more effectively.

6.3 CASE STUDY

We examine the models outputs and show two representative examples in Table 5. In the first example, Image.open was not seen in the training set, and the baseline Code T5 incorrectly predicts os.open. In contrast, using Doc Prompting allows to retrieve the docs and to correctly predict Image.open. In the second example, df.to csv was not seen in training, and the baseline Code T5 fails to correctly predict it. In contrast, Doc Prompting does predict most of the df.to csv call correctly, thanks to the retrieved docs. Nevertheless, Doc Prompting generates an incorrect argument skiprows=1, instead of header=False. The reason is that along with the retrieved documentation of df.to csv, the retriever also retrieved the documentation of df.read csv, which has a skiprows argument. That is, the generator uses an argument of df.read csv with the function df.to csv. Further improving the retrievers and the generators, and post-filtering based on the validity of argument names, may mitigate such mistakes.

Published as a conference paper at ICLR 2023

Table 5: Examples of predictions from Co Na La, of the base Code T5 compared to Code T5+Doc Prompting. Unseen functions are :::::::::: underscored.

NL Intent: Open image picture.jpg Ground truth: img = :::::::::: Image.open( picture.jpg ) \n Img.show Code T5: os.open( picture.jpg , r ) Code T5+Doc Prompting: image = ::::::::::: Image.open( picture.jpg , rb )

NL Intent: Exclude column names when writing dataframe df to a csv file filename.csv Ground truth: ::::::::: df.to csv ( filename.csv , header=False) Code T5: df.drop([ col1 , col2 ], axis=1, inplace=True) Code T5+Doc Prompting: ::::::::: df.to csv( filename.csv , skiprows=1)

7 RELATED WORK

Code generation The most common practice in NL code generation is training a model on a dataset of NL-code pairs (Allamanis et al., 2015; Yin and Neubig, 2017; Rabinovich et al., 2017; Iyer et al., 2018). Nevertheless, all these works assume that their training corpus covers all required libraries and functions, and their models are inherently incapable of generating libraries and functions that were not seen in the training data. On the contrary, Doc Prompting allows models to generate calls to unseen function, by retrieving these functions documentation and reading them at test time. Hayati et al. (2018); Parvez et al. (2021); Hashimoto et al. (2018) and Lu et al. (2017) learn to retrieve examples at test time; Pasupat et al. (2021) also considered settings where the test data has a distribution shift from the training data. However, when new libraries are released they often come with documentation, and thus we assume that documentation for new libraries is much more likely to be available than concrete natural language intent and code snippet pairs (n,c) that use these libraries already. The models of Shrivastava et al. and Wu et al. (2021) retrieve code snippets from relevant files in the same project; contrarily, when predicting new libraries and functions that are external to the user s project, documentation is the source that is the most likely to be available.

Retrieval augmented generation The paradigm of retrieve-then-generate has gained popularity in the field of open-domain question answering (Guu et al., 2020; Lewis et al., 2020; Karpukhin et al., 2020), where the answer for an open-domain question exists in only few documents out of a much larger pool. Although Doc Prompting takes a similar approach, documentation retrieval in code generation is even more valuable, since code libraries are updated constantly, and new libraries are introduced daily. Thus, Doc Prompting allows updating the documentation pool frequently with new contents, without re-training any model components.

Documentation conditioned generation The model of Zhong et al. (2019) reads documents to understand environment dynamics in a grid-world game, and Branavan et al. (2011) controls situated agents in a game (Civilization II) by reading the game s manual. However, all their models were tailored to specific games; in contrast, Doc Prompting is general and is applicable for a variety of programming languages and datasets.

8 CONCLUSION

We propose Doc Prompting, a simple and effective approach for code generation by retrieving the relevant documentation. Doc Prompting consistently improves NL code models in two tasks, in two PLs, and across multiple strong base models. Doc Prompting improves strong base models such as Code T5 by 2.85% in pass@1 (52% relative gain) in execution-based evaluation on the popular Python Co Na La benchmark; on a new Bash dataset tldr, Doc Prompting improves Code T5 and GPT-Neo-1.3B by up to 6.9% exact match, and Codex by 6.78 char BLEU score.

These results open a promising direction for NL code generation. We believe that our results can be further improved using more clever encoding of the structured nature of long documents, and using joint training of the retriever and the generator, which hopefully will avoid cascading errors. Further, we believe that the principles and the methods presented in this paper are applicable to additional code-related tasks, and other documentation-like resources such as tutorials and blog posts. To these ends, we make all our code, data, and models publicly available.

Published as a conference paper at ICLR 2023

9 ACKNOWLEDGEMENT

We thanks the anonymous reviewers for their useful comments and suggestions. This work is supported by a gift from Amazon AI and a contract from the Air Force Research Laboratory under agreement number FA8750-19-2-0200. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government.

Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. Bimodal modelling of source code and natural language. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 2123 2132. JMLR.org, 2015. URL http://proceedings.mlr. press/v37/allamanis15.html.

Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code. In International conference on machine learning, pages 245 256. PMLR, 2020.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. Ar Xiv preprint, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.

Nathana el Beau and Benoˆıt Crabb e. The impact of lexical and grammatical processing on generating code from natural language. Ar Xiv preprint, abs/2202.13972, 2022. URL https://arxiv.org/abs/2202. 13972.

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, 2021. URL https://doi.org/10.5281/zenodo. 5297715. If you use this software, please cite it using these metadata.

S.R.K. Branavan, David Silver, and Regina Barzilay. Learning to win by reading manuals in a Monte-Carlo framework. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 268 277, Portland, Oregon, USA, 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1028.

Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, and Oleksandr Polozov. Generative code modeling with graphs. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=Bke4Ks A5FX.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. Ar Xiv preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1597 1607. PMLR, 2020. URL http://proceedings.mlr.press/v119/chen20j.html.

Andrew Forward and Timothy C Lethbridge. The relevance of software documentation, tools and technologies: a survey. In Proceedings of the 2002 ACM symposium on Document engineering, pages 26 33, 2002.

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. Ar Xiv preprint, abs/2204.05999, 2022. URL https://arxiv.org/abs/2204.05999.

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Sim CSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894 6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https://aclanthology.org/2021.emnlp-main. 552.

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, 2020. URL https://arxiv.org/abs/1908.10396.

Published as a conference paper at ICLR 2023

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. Ar Xiv preprint, abs/2002.08909, 2020. URL https://arxiv.org/abs/ 2002.08909.

Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. A retrieve-and-edit framework for predicting structured outputs. Advances in Neural Information Processing Systems, 31, 2018.

Shirley Anugrah Hayati, Raphael Olivier, Pravalika Avvaru, Pengcheng Yin, Anthony Tomasic, and Graham Neubig. Retrieval-based neural code generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 925 930, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1111. URL https://aclanthology.org/D18-1111.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. ar Xiv preprint ar Xiv:1904.09751, 2019.

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1643 1652, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1192. URL https://aclanthology.org/D18-1192.

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874 880, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.74. URL https://aclanthology.org/2021.eacl-main.74.

Jeff Johnson, Matthijs Douze, and Herv e J egou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535 547, 2019.

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769 6781, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.

Timothy C Lethbridge, Janice Singer, and Andrew Forward. How software engineers use documentation: The state of the practice. IEEE software, 20(6):35 39, 2003.

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K uttler, Mike Lewis, Wen-tau Yih, Tim Rockt aschel, Sebastian Riedel, and Douwe Kiela. Retrievalaugmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 6b493230205f780e1bc26945df7481e5-Abstract.html.

Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. NL2Bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1491.

Yanxin Lu, Swarat Chaudhuri, Chris Jermaine, and David Melski. Data-driven program completion. ar Xiv preprint ar Xiv:1705.09042, 2017.

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. A conversational paradigm for program synthesis. ar Xiv preprint, 2022.

Janet Nykaza, Rhonda Messinger, Fran Boehme, Cherie L Norman, Matthew Mace, and Manuel Gordon. What programmers really want: results of a needs assessment for sdk documentation. In Proceedings of the 20th annual international conference on Computer documentation, pages 133 141, 2002.

Md Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Retrieval augmented code generation and summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2719 2734, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.232. URL https://aclanthology.org/2021. findings-emnlp.232.

Published as a conference paper at ICLR 2023

Panupong Pasupat, Yuan Zhang, and Kelvin Guu. Controllable semantic parsing via retrieval augmentation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7683 7698, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.607. URL https://aclanthology.org/2021.emnlp-main. 607.

Maxim Rabinovich, Mitchell Stern, and Dan Klein. Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1139 1149, Vancouver, Canada, 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1105. URL https://aclanthology.org/P17-1105.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Ar Xiv preprint, abs/1910.10683, 2019. URL https://arxiv.org/abs/1910.10683.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1 67, 2020.

Stephen E Robertson and K Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information science, 27(3):129 146, 1976.

Pedro Rodriguez and Jordan Boyd-Graber. Evaluation paradigms in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9630 9642, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. emnlp-main.758. URL https://aclanthology.org/2021.emnlp-main.758.

Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. How do professional developers comprehend software? In 2012 34th International Conference on Software Engineering (ICSE), pages 255 265. IEEE, 2012.

Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, and Sida I. Wang. Natural language to code translation with execution, 2022. URL https://arxiv.org/abs/2204.11454.

Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for large language models of code. In ICML 2022 Workshop on Knowledge Retrieval and Language Models.

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. Code T5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696 8708, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. URL https://aclanthology.org/2021.emnlp-main.685.

Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution-based evaluation for open-domain code generation. ar Xiv preprint ar Xiv:2212.10481, 2022.

Yuhuai Wu, Markus Norman Rabe, De Lesley Hutchins, and Christian Szegedy. Memorizing transformers. In International Conference on Learning Representations, 2021.

Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, and Graham Neubig. Incorporating external knowledge through pre-training for natural language to code generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6045 6052, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.538. URL https://aclanthology.org/ 2020.acl-main.538.

Frank F Xu, Uri Alon, Graham Neubig, and Vincent J Hellendoorn. A systematic evaluation of large language models of code. Ar Xiv preprint, abs/2202.13169, 2022. URL https://arxiv.org/abs/2202. 13169.

Pengcheng Yin and Graham Neubig. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 440 450, Vancouver, Canada, 2017. Association for Computational Linguistics. doi: 10. 18653/v1/P17-1041. URL https://aclanthology.org/P17-1041.

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. Learning to mine aligned code and natural language pairs from stack overflow. In 2018 IEEE/ACM 15th international conference on mining software repositories (MSR), pages 476 486. IEEE, 2018.

Victor Zhong, Tim Rockt aschel, and Edward Grefenstette. Rtfm: Generalising to novel environment dynamics via reading. Ar Xiv preprint, abs/1910.08210, 2019. URL https://arxiv.org/abs/1910.08210.

Published as a conference paper at ICLR 2023

A T L D R: A NEWLY CURATED SHELL SCRIPTING BENCHMARK

NL Bash pairs For each command (e.g., cat), users contribute examples of pairs of NL descriptions and bash code (mainly one-liners), including various flags and arguments, which cover the common usages of that command. An example is shown in Figure 2.

We crawl NL-code pairs from the markdown files5 in the linux and common folders. We discard Bash commands whose manual is unavailable (discussed below). The detailed statistics are shown in Table 6. On average, each command has 4.84 NL Bash pairs and there is a total of 9187 NL-code pairs. To test the generalizability of a model, we construct the training, development and the test set with completely different commands.

Table 6: The statistics of the tldr shell scripting benchmark

# Commands NL Bash pairs

train 1315 6414 dev 376 1845 test 188 928

total 1879 9187

Documentation pool D We take the bash manual of the 1897 bash commands in tldr to construct a documentation pool. We search each command name at manned.org6, a website which archives Unix manual pages (the same as the Unix man <command> command), and then extract the text contents from the returned manual page. We further break each manual into multiple paragraphs by line breaks so that each paragraph delicately describes a single concept such as a command functionality or a flag usage. We make this decision due to the large volume of content each manual has, which is too long to fit the length limitation of a neural model, and too noisy and distracts the model with irrelevant information. This results in 400k individual entries in the pool in total.

Oracle manual D i We find the ground truth documentation for each (n, c) pair through command name and flag matching heuristics. For instance, given a code snippet toilet input text -f font filename , we constrain our search to the documentation from toilet manual page and select documentation that starts with -f flag as an oracle paragraph. Along with the first paragraph that commonly summarizes a command, these paragraphs forms D n.

Evaluation metrics We use four evaluation metrics to measure the quality of the generated code: (a) command name accuracy (CMD Acc) measures whether the command name (e.g., cat) is predicted correctly; (b) token-level F1 converts the reference code and the generated code to bag of words and measures the token-level precision, recall, and F1 overlap; (c) exact match (EM) measures the exact match between the reference and the generation; and (d) character-level BLEU (char BLEU; Lin et al., 2018; Shi et al., 2022). For token level F1, exact match, and char BLEU, we disregard all user-specific variables in the references and the system outputs. For example, mycli -u [user] -h [host] [database] is converted into mycli -u $1 -h $2 $3 . This is mainly because the variables are not instantiated in tldr and the style of the placeholder varies among contributors. For example, some contributors might write [user] as [username] or [your name]. Therefore, measuring the surface form of user-specific variable names is less meaningful.

B RE-SPLITTING CONALA

NL Python pairs We adapt the popular Co Na La benchmark and re-split the dataset to test the generalization scenario. This re-split makes every example in the development and the test set have at least one Python function (e.g., plt.plot) that was not seen in the training data. There are 2135, 201, and 543 examples in the training, development and test sets, respectively. We follow the original work Yin et al. (2018) to evaluate the system outputs with BLEU-4. Since we focus on the generalization setting, we additionally report unseen function accuracy, which measures the percentage of correctly predicted held-out functions that do not appear in the training set.

5e.g., https://github.com/tldr-pages/tldr/blob/main/pages/linux/toilet.md 6https://manned.org

Published as a conference paper at ICLR 2023

Human-annotated unit tests Following Chen et al. (2021) and Austin et al. (2021), we conduct executionbased evaluation on Co Na La to measure the functional correctness of the generated code. We randomly selected 100 examples from the test set and manually annotated unit test for each example. For example, we wrote tests such as assert gen code("abcds", 2) == 4 and assert gen code("abde", 2) == -1 to verify whether the function gen code could perform find the index of sub string s in string str starting from index 2 . Each example was annotated by a single annotator. The annotation was done by two authors of the paper who program with Python daily. On average, we annotate 2.03 unit tests for each example.

Documentation pool D Our documentation pool contains 35763 manuals. These functions are from all Python libraries that are available on Dev Docs7. These libraries contains the Python built-in library, and popular libraries like numpy and pandas. The documentation on Dev Docs are curated and further transformed and indexed to allow for quick searching of APIs. We then extract each API signature and the corresponding documentation in every library, remove any content in the documentation that is not text, and segment the documentation into multiple paragraphs based on the <p> HTML tags. The documentation pool then contains pairs of the API signature and a single paragraph in the corresponding documentation. Although the documentation pool is not comprehensive to cover all Python libraries and functions, we find it has a high coverage rate on the Co Na La dataset. This choice reflects the flexibility of our approach upon the characteristics of a target scenario.

Oracle manual D i To find the oracle documents for a given NL intent D i from the original (n, c) example, we first index the function names with absolute path (e.g., plot is indexed with matplotlib.pyplot.plot) with Elasticsearch. Then we query the search engine with clean version of c where variable name are removed. The top-5 functions after de-duplication are treated as oracle manuals D i .

Natural language and code associations during pretraining Despite our efforts, it is possible that some of the held-out functions in the test set were seen to associate with NL contexts (e.g., comments) during the pretraining of a retriever and a generator. Since the generators were initialized from the same checkpoint in both the baselines and the Doc Prompting models, such a possible association is expected to equally help both models. In the retriever, such a possible association did not cause the retriever to see the exact NL intents together with the corresponding documentation, and thus the matching between NL doc was not leaked. However, it is possible that there had been semantically similar intents seen along with the code snippets of the held-out functions. Nevertheless, such co-occurrence is indirect and unsupervised .

C DENSE RETRIEVER TRAINING

We finetune the model for 10 epochs with batch size of 512 and learning rate of 1e 5. Since Code T5 does not use [CLS] token, we alternatively take the average of the hidden state of the last layer as the text representation. For Co Na La, we also use the first 100k mined examples provided as part of Co Na La as the supervised corpus. For Co Na La, we only apply a single search step because each code snippet commonly contains more than one function. We also observed that using the first sentence that normally summarizes the usage of a function achieve the best retrieval performance than other alternatives such as using the first paragraph, or simply truncating to the maximum token length. The training takes up to 15 hours on a single A6000 GPU.

D GENERATOR TRAINING

We train our single-source generators for 20 epochs with learning rate 4e 5. We train our Fi D-based generators for 10000 steps. The doc length is set to 200, any further content will be truncated. We follow (Izacard and Grave, 2021) to set learning rate to 5e 5 with 2000 steps warmup and linear learning rate decay. The batch size is set to 8. The best model is selected based on the token-level F1 score on the development set for tldr and BLEU score for Co Na La. The training takes 8 hours on a single A6000 GPU.

E CODEX PROMPTS

For the baseline, we prompt Codex with three NL-code pairs and append the test query to the end. An example on tldr is shown on top of Table 7. On the bottom, we list the prompt with Doc Prompting where documentation is provided along too. In the oracle command name setting, we prepend the command name before each NL

7https://devdocs.io

Published as a conference paper at ICLR 2023

1 3 5 10 15 20 25 30

20 30 40 50 60 70

Retrieval Recall@k

1 3 5 10 15 20 30

Retrieved Docs

Generation BLEU

Recall BLEU

Figure 5: The recall@k (%) and the corresponding BLEU score by using these top-k docs on Co Na La dataset (using Code T5).

intent for the baseline prompt. For Doc Prompting prompt, we replace the potential docs with the retrieved docs from the oracle manual.

F ADDITIONAL ANALYSIS

Parameter efficiency As shown in Table 1, under a given parameter budget, we find that Doc Prompting mostly benefits from parallel encoding (Fi D). For example, the parallel encoding T5+Doc Prompting (220M parameters) significantly outperforms the 125M parameters joint encoding Neo-125M+Doc Prompting. Only scaling up Neo+Doc Prompting to 1.3B parameters manages to match the 220M parameter T5+Doc Prompting. A possible explanation is that although the base Neo-1.3B (without Doc Prompting) generally performs better than the base T5 (without Doc Prompting), parallel encoding allows to utilize the retrieved documents better, since documents are encoded independently on the encoder side.

The impact of the number of documents Figure 5 shows the recall@k and the BLEU score compared to k, the number of retrieved documents. Increasing k consistently yields a higher recall; however, as more irrelevant documents are retrieved, the generator cannot effectively distinguish them from the relevant ones and the overall performance remain similar. For example, Code T5 achieves the highest BLEU score using 5 k 10. In contrast, when the generator is provided with the oracle docs only, its BLEU score reaches 49.04 (Table 3). This suggests that both precision and recall of docs are important, and the benefit of using larger values of k in open domain QA (Izacard and Grave, 2021) does not necessarily hold in code generation.

Full n-gram overlap Table 8 shows that using documentation significantly increases the n-gram overlap recall between the input and the output, in tldr and Co Na La. Since we used BM25 to retrieve docs in tldr, the NL Retrieved docs overlap is high by construction. In Co Na La, the NL Retrieved docs unigram overlap is high as well, but since we used a dense retriever, the general n-gram overlap does not have to be high for Doc Prompting to work well.

Retrieval latency Although retrieving docs results in additional test-time computation, the increase in latency is not prohibitive. First, encoding the input for the retrieval step costs a single forward pass through the retriever s encoder, which is significantly less expensive than generation (which requires multiple time steps of the decoder). All the documentation in the retrieval pool can be encoded in advance, and finding the top-k results can be performed quickly using libraries such as FAISS Johnson et al. (2019) on the GPU or Sca NN Guo et al. (2020) on CPU. The cost of this top-k search is sub-linear in the size of the document pool. Second, the additional input to the generator results in an increased memory consumption, but only a small increase in latency since the tokens of a given input can be encoded in parallel. If this difference is crucial in practical settings, we can decrease the number of retrieved documents. Figure 5 shows that retrieving as few as five docs may be sufficient in many cases.

G FULL PASS@k PLOTS

In the main execution-based evaluation, pass@k results in Section 5.2 and Figure 3, we took the best temperature for every model and value of k. Here, we show all the pass@k plots with different temperatures in Figure 6.

Published as a conference paper at ICLR 2023

# get the label of a fat32 partition fatlabel /dev/sda1 # END

# display information without including the login, jcpu and pcpu columns w --short # END

# sort a csv file by column 9 csvsort -c 9 data.csv # END

# search for a package in your current sources

Potential document 0: fatlabel will display or change the volume label or volume ID on the MSDOS filesystem located on DEVICE ...

# get the label of a fat32 partition fatlabel /dev/sda1 # END

Potential document 0: w displays information about the users currently on the machine, and their processes. The header shows, in this order ...

Potential document 1: -s, short Use the short format. Don t print the login time, JCPU or PCPU times.

# display information without including the login, jcpu and pcpu columns w --short # END

Potential document 0: Sort CSV files. Like the Unix sort command, but for tabular data

Potential document 1: usage: csvsort [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u 0,1,2,3] [-b] [-p ESCAPECHAR] ...

Potential document 2: optional arguments: -h, hel show this help message and exit -n, names Display column names and indices from the input CSV and exit. -c COLUMNS ...

Potential document 3: csvsort -c 9 examples/realdata/FY09 EDU Recipients by State.csv

Potential document 4: csvcut -c 1,9 examples/realdata/FY09 EDU Recipients by State.csv csvsort -r -c 2 head -n 5

# sort a csv file by column 9 csvsort -c 9 data.csv # END

Potential document 1: ...

Potential document 2: ...

# search for a package in your current sources

Table 7: Top: baseline Codex prompt with three NL-code pairs and a test intent. Bottom: Doc Prompting prompt for Codex. In each in-context learning example, the oracle docs, the NL intent and the corresponding bash command are provided. We use up to five oracle docs for these examples. For a test example, the top-5 paragraphs from the retriever are represented with the NL intent. The documents contents were omitted ( ... ) to save space.

Published as a conference paper at ICLR 2023

Table 8: n-gram overlap between different contents (%). Using documentation significantly increases the n-gram overlap recall between the input and the output, in tldr and Co Na La.

NL Code 12 0 0 (NL+retrieved docs) Code 24 2 0 NL Retrieved docs 39 8 3

Co Na La 1 2 3 4 5

NL Code 30 14 11 9 7 (NL+retrieved docs) Code 91 52 28 16 11 NL Retrieved docs 72 14 3 1 1

0 25 50 75 100 125 150 175 200 k

temperature=0.2

Code T5 +Doc Prompting

0 25 50 75 100 125 150 175 200 k

temperature=0.4

Code T5 +Doc Prompting

0 25 50 75 100 125 150 175 200 k

temperature=0.6

Code T5 +Doc Prompting

0 25 50 75 100 125 150 175 200 k

temperature=0.8

Code T5 +Doc Prompting

0 25 50 75 100 125 150 175 200 k

temperature=1.0

Code T5 +Doc Prompting

Figure 6: Pass@k on 100 examples on the test set with different temperatures.

Published as a conference paper at ICLR 2023

Table 9: Results on tldr and Co Na La with code-davinci-002.

Model CMD Acc (%) EM (%) Token F1 char BLEU

Codex - 39.01 14.55 44.89 33.93 3-shots +Doc Prompting 36.10 13.97 42.55 32.93

With the oracle command name

- - 20.22 59.22 38.14 +Doc Prompting - 33.15 68.59 44.76

BLEU Recall

- 48.39 43.35 + Doc Prompting 47.21 44.70

+ Doc Prompting oracle docs 54.67 59.68

H EXPERIMENTS WITH code-davinci-002

The results with code-davinci-002 under few-shot learning setting is shown in Table 9. In the non-oracle settings, Codex+Doc Prompting did not improve over the base Codex; one explanation might be that the datasets are leaked into the training corpus of the Codex. For example, Co Na La was extracted from Stack Overflow, which is included in the large Common Crawl corpus8 that was used to train GPT-3, and possibly Codex. Therefore, Codex might have memorized the target code, and thus did not need the additional documentation. Although the data leakage issue might have happened in code-davinci-001 as well, we suspect that this issue has worsened in the stronger 002 version. Regardless, we believe that the large capacity of Codex requires an equally strong retriever to improve over the base model. With an oracle retriever, Doc Prompting yields significant improvement on both datasets. Thus, the non-oracle results could be further improved using a stronger non-oracle retriever.

Examples on tldr are in Table 10 . In the top three cases, the baseline T5 could not generate the correct bash command while T5+Doc Prompting retrieves the correct bash documentation and use the correct arguments. In the bottom two cases, T5 also predicts the bash command correctly. However, in the puppet case, it misses concrete usages of executing the code. In the elixir case, it halluciate the argument flags --example and --arguments which elixir does not support at all. On the other hand, T5+Doc Prompting does ground its generation to the retrieved docs and therefore it generates more accurate code.

More examples on Co Na La are shown in Table 11.

8https://commoncrawl.org/the-data/

Published as a conference paper at ICLR 2023

Table 10: Examples of predictions from tldr, of the base T5 compared to T5+Doc Prompting. In tldr, the bash commands in the test set are completely different from those in the training set.

play a midi file in a specific key (0 = c major/a minor, 1 = f major/d minor, +1 = g major/e minor, etc.) Ground truth: timidity --force-keysig= -flats|+sharps path/to/file.mid T5: midiplay key T5+Doc Prompting: timidity --force-keysig f|f|f path/to/file.midi

clean up all temporary tex files in the current directory Ground truth: latexmk -c T5: tex clean T5+Doc Prompting: latexmk -c

specify the hash algorithm for the index file (experimental) Ground truth: git show-index --object-format= sha1|sha256 path/to/file T5: hash --experimental path/to/index.hashes T5+Doc Prompting: git show-index --object-format= path/to/hash-algorithm path/to/experimental

execute puppet code Ground truth: puppet apply --execute code T5: puppet T5+Doc Prompting: puppet apply --execute path/to/file.puppet

evaluate elixir code by passing it as an argument Ground truth: elixir -e code T5: elixir --example --arguments path/to/file.elixir T5+Doc Prompting: elixir -e path/to/file.elixir

Table 11: Examples of predictions from Co Na La, of the base Code T5 compared to Code T5+Doc Prompting. Unseen functions are :::::::::: underscored.

set the current working directory to c:\Users\uname\desktop\python Ground truth: ::::::::: os.chdir( c:\Users\uname\desktop\python ) Code T5: os.system( c:\Users\uname\desktop\python ) Code T5+Doc Prompting: ::::::::: os.chdir( c:\Users\uname\desktop\python )

convert dataframe df to integer-type sparse object Ground truth: ::::::::::::: df.to sparse(0) Code T5: np.isinstance(df, np.integer) Code T5+Doc Prompting: ::::::::::::: df.to sparse( i )