FunCom

A Java function and comment parallel corpus

Introduction

Funcom is a collection of ~2.1 million Java methods and their associated Javadoc comments. This data set was derived from a set of 51 million Java methods and only includes methods that have an associated comment, comments that are in the English language, and has had auto-generated files removed. Each method/comment pair also has an associated method_uid and project_uid so that it is easy to group methods by their parent project.

Contact

Alexander LeClair - Website
Collin McMillan - Website

Cite this work

LeClair, A., McMillan, C., "Recommendations for Datasets for Source Code Summarization", in Proc. of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL'19), Short Research Paper Track, Minneapolis, USA, June 2-7, 2019.

Data

There are three versions of this data set available for download.

  • 51 million Java method and comment data set as an SQL database dump plus a download of the source files from the Sourcerer data set
    994 MB
  • 2.1 million Java method and comments with unprocessed source code and unprocessed comments
    183 MB
  • 2.1 million Java methods and comments. Preprocessed source code with special characters removed, camel case split, lowercased. Comments are the first line of the javadoc lowercased with special characters removed
    201 MB

Examples

The examples below are taken from both data sets to highlight the differences between the raw/processed and tokenized sets. The first two examples come from the raw/processed data sets while the second two are the tokenized versions of the same method/comment pairs.

project_id function_id function comment
10536 9245436 ' public void close() throws IOException {\n input.close();\n }\n' ' /** By default, closes the input Reader. */\n'
52274 50900999 '\tpublic void render(GameData data) {\n\t\tsetText(Message.render(data, type.getPattern(), attributes));\n\t}\n' '\t/**\n\t * Renders the message and updates the message text.\n\t *\n\t * @param data The GameData for replacing unit IDs and region coordinates\n\t */\n'
10536 9245436 'public void close throws ioexception input close' 'by default closes the input reader'
52274 50900999 'public void render game data data set text message render data type get pattern attributes' 'renders the message and updates the message text'

Filtered data set - download

2.1 million method/comment pairs. English language comments and autogenerated source code/comments removed. No preprocessing has been done on the method/comment strings, just filtering of the data set.

Tokenized data set - download

Similar to the processed data set, but already tokenized to easily prototype downstream tasks.

Raw data set - download

The SQL dump of the data set includes a total of 12 tables. To fully reconstruct the raw data set you must also download the Sourcerer Java Data Set using the line begin and line end attributes to obtain the raw methods directly from the files. The paths in the DB will be slightly different than yours so be aware.