A Java function and comment parallel corpus


Funcom is a collection of ~2.1 million Java methods and their associated Javadoc comments. This data set was derived from a set of 51 million Java methods and only includes methods that have an associated comment, comments that are in the English language, and has had auto-generated files removed. Each method/comment pair also has an associated method_uid and project_uid so that it is easy to group methods by their parent project.


Alexander LeClair - Website
Collin McMillan - Website

Cite this work

LeClair, A., McMillan, C., "Recommendations for Datasets for Source Code Summarization", in Proc. of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL'19), Short Research Paper Track, Minneapolis, USA, June 2-7, 2019.


There are three versions of this data set available for download.

  • 51 million Java method and comment data set as an SQL database dump plus a download of the source files from the Sourcerer data set
    994 MB
  • 2.1 million Java method and comments with unprocessed source code and unprocessed comments
    183 MB
  • 2.1 million Java methods and comments. Preprocessed source code with special characters removed, camel case split, lowercased. Comments are the first line of the javadoc lowercased with special characters removed
    201 MB


The examples below are taken from both data sets to highlight the differences between the raw/processed and tokenized sets. The first two examples come from the raw/processed data sets while the second two are the tokenized versions of the same method/comment pairs.

project_id function_id function comment
10536 9245436 ' public void close() throws IOException {\n input.close();\n }\n' ' /** By default, closes the input Reader. */\n'
52274 50900999 '\tpublic void render(GameData data) {\n\t\tsetText(Message.render(data, type.getPattern(), attributes));\n\t}\n' '\t/**\n\t * Renders the message and updates the message text.\n\t *\n\t * @param data The GameData for replacing unit IDs and region coordinates\n\t */\n'
10536 9245436 'public void close throws ioexception input close' 'by default closes the input reader'
52274 50900999 'public void render game data data set text message render data type get pattern attributes' 'renders the message and updates the message text'

Filtered data set - download

2.1 million method/comment pairs. English language comments and autogenerated source code/comments removed. No preprocessing has been done on the method/comment strings, just filtering of the data set.

Tokenized data set - download

Similar to the processed data set, but already tokenized to easily prototype downstream tasks.

Raw data set - The files used to create our data set were part of the UCI Sourcerer project and are no longer available for download. Please feel free to send me an email if you have any question.