Does BLEU Score Work for Code Migration

Statistical machine translation (SMT) and its automatic metric BLEU have been applied widely to a Software Engineering task named code migration. (In)Validating the use of BLEU score could advance the research and development of SMT-based code migration tools. Unfortunately, there is no study to approve or disapprove the use of BLEU score for source code.
In this work, we conducted an empirical study on BLEU score to invalidate its suitability for the code migration task due to its inability to reflect the semantics of source code. For our study, we used human judgment as the ground truth to measure the semantic correctness of the migrated code. We demonstrated that BLEU score has a weak correlation with the semantic correctness of translated code, and also provided counter-examples to show that BLEU is ineffective in comparing the translation quality between SMT-based models. Due to BLEU's ineffectiveness for code migration task, we propose an alternative metric RUBY, which considers lexical, syntactical, and semantic representations of source code. We verified that RUBY achieves a strong correlation with the semantic correctness of migrated code, and RUBY is effective in reflecting the changes in translation quality of SMT-based translation models.

Hypothesis: BLEU score does not measure well the quality of translated results that is estimated based on the similarity in term of semantics/functionality between the reference source code and the migrated one.
Research Questions:

RQ1: Does BLEU score reflect well the semantic similarity between the translated source code and the reference code in the ground truth?
RQ2: Is BLEU effective in comparing the translation quality of SMT-based code migration models?
RQ3: What is the alternative metric to measure the semantic accuracy of migrated code if BLEU is not effective in evaluating the translated results?