Name	Name	Last commit message	Last commit date
parent directory ..
code	code
dataset	dataset
evaluator	evaluator
README.md	README.md
requirements.txt	requirements.txt

CodeXGLUE -- Type Prediction -- TypeScript

Task Definition

Given a sequence of source code, the task is to predict the correct type for a particular variable, parameter, or function. Type prediction is an important task for software developement, especially in dynamically-typed environments, which can benefit from stronger type checking while maintaining the advantages of dynamic typing. We treat the task as sequence tagging task, similar to Named-Entity Recognition (NER) in NLP.

Dataset

The dataset originates from the MSR '22 paper ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference. The dataset is available for download on Zenodo.

Download and Preprocess

Download dataset either using wget per below or by clicking on the link. Unzip and run the preprocess script with chosen parameters like type vocabulary size. To use the default type vocabulary size of 50k types, train#.jsonl, test.jsonl, valid.jsonl are already available in the main directory.

cd CodeXGLUE/Code-Code/TypePrediction-TypeScript/dataset
wget https://zenodo.org/record/6387001/files/ManyTypes4TypeScript.tar.gz?download=1 -O ManyTypes4TypeScript.tar.gz
tar -xvzf ManyTypes4TypeScript.tar.gz

Install required packages. Use

pip install -r requirements.txt

Preprocess dataset Use the process_datafiles.py in dataset folder.

python process_datafiles.py -v <vocab-size>

A vocabulary of 50,000 was used in ManyTypes4TypeScript paper.

Data Format

After preprocessing dataset there should be five jsonl files. train0.jsonl, train1.jsonl, train2.jsonl, test.jsonl, valid.jsonl. Train is split to accomodate Git LFS.

Each line represents a file. The data fields are the same among all splits.

field name.	type	description
tokens	list[string]	Sequence of tokens (word tokenization)
labels	list[string]	A list of corresponding types
url	string	Repository URL
path	string	Original file path that contains this code
commit_hash	string	Commit identifier in the original project
file	string	File name

Data Splits

name	train	test	validation
projects	11,413 (81.8%)	1,336 (9.58%)	1,204 (8.62%)
files	486,477 (90.16%)	28,045 (5.20%)	25,049 (4.64%)
sequences	1,727,927 (91.95%)	81,627 (4.34%)	69,652 (3.71%)
types	8,696,679 (95.33%)	224,415 (2.46%)	201,428 (2.21%)

Evaluator

We provide a script to evaluate predictions for this task, and report accuracy score. It can be found at evaluator/evaluator.py. The train.py in the code directory also has a built in evaluator and reports scores before generating a list of predictions.

Example

python evaluator/evaluator.py -a evaluator/gold_labels.txt -p evaluator/example-codebert-predictions.txt

{'Acc': 0.6280417975625515}

Input predictions

A prediction file that has predictions in the form of

index	type

See example-codebert-predictions.txt.

Pipeline-CodeBERT

We also provide a pipeline that finetunes CodeBERT on this task; see train.py. For models with torch weights, the model state_dict can be loaded traditionally with torch.load()

Train (Finetune)

The train.py file will train and eval by default. The validation dataset can be replaced in the evalulation. We provide a pipeline for finetuning huggingface models like CodeBERT; this can be expanded to any huggingface pretrained encoding model.

CUDA_VISIBLE_DEVICES=0 accelerate launch train.py --output_dir type-model --train_batch_size=36 --eval_batch_size=16

Inference

CUDA_VISIBLE_DEVICES=0 accelerate launch train.py --output_dir type-model --do_train=False --eval_batch_size=16

Evaluation

python evaluator/evaluator.py -a evaluator/gold_labels.txt -p evaluator/example-codebert-predictions.txt

{'Acc': 0.6280417975625515}

Results

The results from the paper are listed below. Top 100 indicates the performance across the top 100 most frequently occuring types.

Model	Top 100				Overall
	Precision	Recall	F1	Accuracy	Precision	Recall	F1	Accuracy
CodeBERT	84.58	85.98	85.27	87.94	59.34	59.80	59.57	61.72
GraphCodeBERT	84.67	86.41	85.53	88.08	60.06	61.08	60.57	62.51
CodeBERTa	81.31	82.72	82.01	85.94	56.57	56.85	56.71	59.81
PolyGot	84.45	85.45	84.95	87.72	58.81	58.91	58.86	61.29
GraphPolyGot	83.80	85.23	84.51	87.40	58.36	58.91	58.63	61.00
RoBERTa	82.03	83.81	82.91	86.25	57.45	57.62	57.54	59.84
BERT	80.04	81.50	80.76	84.97	54.18	54.02	54.10	57.52

Reference

Please cite the original paper.


@article{jesse2022manytypes4typescript,
  title={ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference},
  author={Jesse, Kevin and Devanbu, Premkumar T},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

CodeXGLUE -- Type Prediction -- TypeScript

Task Definition

Dataset

Download and Preprocess

Data Format

Data Splits

Evaluator

Example

Input predictions

Pipeline-CodeBERT

Train (Finetune)

Inference

Evaluation

Results

Reference

FilesExpand file tree

TypePrediction-TypeScript

Directory actions

More options

Directory actions

More options

Latest commit

History

TypePrediction-TypeScript

Folders and files

parent directory

README.md

CodeXGLUE -- Type Prediction -- TypeScript

Task Definition

Dataset

Download and Preprocess

Data Format

Data Splits

Evaluator

Example

Input predictions

Pipeline-CodeBERT

Train (Finetune)

Inference

Evaluation

Results

Reference