Given a sequence of source code, the task is to predict the correct type for a particular variable, parameter, or function. Type prediction is an important task for software developement, especially in dynamically-typed environments, which can benefit from stronger type checking while maintaining the advantages of dynamic typing. We treat the task as sequence tagging task, similar to Named-Entity Recognition (NER) in NLP.
The dataset originates from the MSR '22 paper ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference. The dataset is available for download on Zenodo.
- Download dataset either using wget per below or by clicking on the link. Unzip and run the preprocess script with chosen parameters like type vocabulary size. To use the default type vocabulary size of 50k types, train#.jsonl, test.jsonl, valid.jsonl are already available in the main directory.
cd CodeXGLUE/Code-Code/TypePrediction-TypeScript/dataset
wget https://zenodo.org/record/6387001/files/ManyTypes4TypeScript.tar.gz?download=1 -O ManyTypes4TypeScript.tar.gz
tar -xvzf ManyTypes4TypeScript.tar.gz
- Install required packages. Use
pip install -r requirements.txt- Preprocess dataset Use the process_datafiles.py in dataset folder.
python process_datafiles.py -v <vocab-size> A vocabulary of 50,000 was used in ManyTypes4TypeScript paper.
After preprocessing dataset there should be five jsonl files. train0.jsonl, train1.jsonl, train2.jsonl, test.jsonl, valid.jsonl. Train is split to accomodate Git LFS.
Each line represents a file. The data fields are the same among all splits.
| field name. | type | description |
|---|---|---|
| tokens | list[string] | Sequence of tokens (word tokenization) |
| labels | list[string] | A list of corresponding types |
| url | string | Repository URL |
| path | string | Original file path that contains this code |
| commit_hash | string | Commit identifier in the original project |
| file | string | File name |
| name | train | test | validation |
|---|---|---|---|
| projects | 11,413 (81.8%) | 1,336 (9.58%) | 1,204 (8.62%) |
| files | 486,477 (90.16%) | 28,045 (5.20%) | 25,049 (4.64%) |
| sequences | 1,727,927 (91.95%) | 81,627 (4.34%) | 69,652 (3.71%) |
| types | 8,696,679 (95.33%) | 224,415 (2.46%) | 201,428 (2.21%) |
We provide a script to evaluate predictions for this task, and report accuracy score. It can be found at evaluator/evaluator.py. The train.py in the code directory also has a built in evaluator and reports scores before generating a list of predictions.
python evaluator/evaluator.py -a evaluator/gold_labels.txt -p evaluator/example-codebert-predictions.txt{'Acc': 0.6280417975625515}
A prediction file that has predictions in the form of
index typeSee example-codebert-predictions.txt.
We also provide a pipeline that finetunes CodeBERT on this task; see train.py. For models with torch weights, the model state_dict can be loaded traditionally with torch.load()
The train.py file will train and eval by default. The validation dataset can be replaced in the evalulation. We provide a pipeline for finetuning huggingface models like CodeBERT; this can be expanded to any huggingface pretrained encoding model.
CUDA_VISIBLE_DEVICES=0 accelerate launch train.py --output_dir type-model --train_batch_size=36 --eval_batch_size=16CUDA_VISIBLE_DEVICES=0 accelerate launch train.py --output_dir type-model --do_train=False --eval_batch_size=16python evaluator/evaluator.py -a evaluator/gold_labels.txt -p evaluator/example-codebert-predictions.txt{'Acc': 0.6280417975625515}
The results from the paper are listed below. Top 100 indicates the performance across the top 100 most frequently occuring types.
| Model | Top 100 | Overall | ||||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Accuracy | Precision | Recall | F1 | Accuracy | |
| CodeBERT | 84.58 | 85.98 | 85.27 | 87.94 | 59.34 | 59.80 | 59.57 | 61.72 |
| GraphCodeBERT | 84.67 | 86.41 | 85.53 | 88.08 | 60.06 | 61.08 | 60.57 | 62.51 |
| CodeBERTa | 81.31 | 82.72 | 82.01 | 85.94 | 56.57 | 56.85 | 56.71 | 59.81 |
| PolyGot | 84.45 | 85.45 | 84.95 | 87.72 | 58.81 | 58.91 | 58.86 | 61.29 |
| GraphPolyGot | 83.80 | 85.23 | 84.51 | 87.40 | 58.36 | 58.91 | 58.63 | 61.00 |
| RoBERTa | 82.03 | 83.81 | 82.91 | 86.25 | 57.45 | 57.62 | 57.54 | 59.84 |
| BERT | 80.04 | 81.50 | 80.76 | 84.97 | 54.18 | 54.02 | 54.10 | 57.52 |
Please cite the original paper.
@article{jesse2022manytypes4typescript,
title={ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference},
author={Jesse, Kevin and Devanbu, Premkumar T},
year={2022}
}