【論文要約】Code-to-Code Search Based on Deep Neural Network and Code Mutation

10月 28, 2023

コードクローンの関連でチェックしたい論文：

Code-to-Code Search Based on Deep Neural Network and Code Mutation (osaka-u.ac.jp)

Abstract of the paper in 2 lines:

The paper presents an approach for code-to-code search using a Deep Neural Network (DNN) model and code mutation to generate sufficient code fragments for each label.

The preliminary experiment demonstrates high precision and recall of the proposed approach.[1]

What are the contributions of this paper

The paper presents an approach for code-to-code search based on a Deep Neural Network (DNN) model and code mutation, which generates sufficient code fragments for each label.[1]

The approach consists of two steps: STEP L (Learning) and STEP S (Search), enabling the identification of similar code fragments corresponding to a query code fragment.[2]

A case study conducted on three open-source software systems, HBase 2.04, OpenSSL 0.9.1 1.1.05, and FreeBSD 11.1.06, demonstrates the effectiveness of the approach in terms of precision, recall, and F-measure.[3]

The preliminary experiment shows high precision and recall of the proposed approach, indicating its potential for accurate code-to-code search.[1]

Note: The provided sources do not explicitly mention the contributions of the paper. However, based on the information provided in the abstract and case study section, these are the contributions inferred from the paper.

Related Papers

Here are some related papers:

1. Code function taste detection method based on deep semantics

Shi Chongyang+2 others • 2019

2. A Search-Based Testing Framework for Deep Neural Networks of Source Code Embedding

Maryam Vahdat Pour+3 others • 2021, arXiv: Software Engineering

1 citations

3. Semantic code search using Code2Vec: A bag-of-paths model

Lakshmanan Arumugam • 2020

4 citations

4. Interactive code searching method and device based on structured embedding

Peng Min+4 others • 2020

5. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Yue Wang+3 others • 2021, Empirical Methods in Natural Language Processing

Practical implications of this paper

Practical Implications of the Paper "Code-to-Code Search Based on Deep Neural Network and Code Mutation":

The proposed approach enables accurate code-to-code search by utilizing a Deep Neural Network (DNN) model and code mutation techniques.

It allows software engineers to search for syntactically similar source code, which can be beneficial for tasks such as code reuse, bug fixing, and understanding unfamiliar codebases.

The approach has been evaluated on multiple open-source software systems, demonstrating its effectiveness in terms of precision, recall, and F-measure.

By generating a sufficient number of code fragments for each label through code mutation, the approach overcomes the challenge of requiring a large number of code fragments for DNN model training.

The use of DNNs for code-to-code search expands the application of deep learning techniques in software engineering, offering a new approach for code analysis and exploration.

Summarize introduction of this paper

Introduction of the Paper "Code-to-Code Search Based on Deep Neural Network and Code Mutation":

Developers often need to find code fragments with similar functionality, such as more efficient or reliable implementations or less vulnerable alternatives.

Previous approaches have been proposed for code-to-code search, including the use of Deep Neural Networks (DNNs) for code clone detection.

The key insight of this paper is to utilize DNNs for labeling code fragments based on their functionality, enabling code-to-code search.

The approach involves two steps: STEP L (Learning) and STEP S (Search). In STEP L, mutated source code fragments are generated, clustered, and labeled with unique labels. The feed-forward neural network model learns tuples of feature vectors and labels. In STEP S, the trained model calculates a label for a query code fragment and retrieves the original code blocks corresponding to that label.

The effectiveness of the approach is confirmed through a case study on three open-source software systems.

[1][2]

Literature survey of this paper

Literature survey of the paper "Code-to-Code Search Based on Deep Neural Network and Code Mutation":

Deep Neural Networks (DNNs) have been commonly used for image labeling tasks, such as object detection.

The paper proposes the use of DNNs for code-to-code search, which is a novel application of deep learning in software engineering.

The approach presented in the paper utilizes code mutation techniques to generate a sufficient number of code fragments for each label in the learning process of DNNs[1].

The paper mentions the use of other deep networks, such as RNN, for code-to-code search, indicating the potential for further exploration and improvement of the approach[2].

The approach has been evaluated on multiple open-source software systems, including HBase, OpenSSL, and FreeBSD, demonstrating its performance in terms of precision, recall, and F-measure[3][4].

The paper also discusses the generation of feature vectors using techniques like Bag of Words (BoW) and Doc2Vec, highlighting the flexibility of the approach in handling different types of code representations[5].

Note: The provided sources do not explicitly mention a comprehensive literature survey. However, the information extracted from the sources provides an overview of the existing literature and the novelty of the proposed approach.

Methods used in this paper

Methods used in this paper:

Deep Neural Networks (DNNs) are utilized for code-to-code search, where code fragments are labeled based on their functionality .

Code mutation techniques are employed to generate a sufficient number of code fragments for each label in the learning process of DNNs .

The approach is evaluated using precision, recall, and F-measure as evaluation metrics [2].

A case study is conducted on three open-source software systems, including HBase, OpenSSL, and FreeBSD, to demonstrate the performance of the approach [3] [4].

Note: The provided sources do not explicitly mention the methods used in the paper. However, the information extracted from the sources provides an understanding of the approach and its evaluation.

What data has been used in this paper

Data used in this paper:

The paper utilizes a dataset created for training and evaluation purposes, following the steps outlined in STEP E1. Similar code block sets are extracted from the target open-source software (OSS), and sets with 100 or more code blocks are selected to ensure an adequate number of similar code blocks for training and evaluation [1].

Multiple versions of open-source software systems, including HBase, OpenSSL, and FreeBSD, are used in a case study to evaluate the performance of the approach [2].

Positive data, which are vectors generated from code blocks that meet specific conditions, are used in the experiments for evaluation of search performance .

The experiment involves the use of code fragments, feature vectors, and labels for training and testing the deep neural network model [3].

Note: The provided sources do not explicitly mention the specific details of the data used in the paper. However, the information extracted from the sources provides an understanding of the dataset creation, case study, and evaluation process.

Results of the paper

Results of the paper:

The preliminary experiment shows high precision and recall of the proposed approach for code-to-code search based on a deep neural network model and code mutation [1].

Table II presents the precision, recall, and F-score results of the experiments. Recall in experiments using HBase and OpenSSL is 1.000, indicating that the model can accurately search for syntactically similar code blocks .

However, Figure 10 shows an example of an incorrect search result, where the trained model suggests similarity between dissimilar code blocks. The differences in Figure 10 can potentially cause such cases [2].

Precision, recall, and F-measure are used as evaluation criteria for search performance [3] [4].

A case study is conducted on the effect of mutation, investigating the relationship between the number of positive data for training and the probability of correct label output by the trained model [5].

The performance of the approach is evaluated using precision, recall, and F-measure on benchmark datasets from HBase, OpenSSL, and FreeBSD [6].

Note: The provided sources do not explicitly mention the overall results of the paper. However, the information extracted from the sources provides insights into the performance evaluation and case study conducted in the paper.

あしとみ独習メモ