Abstract
This paper focuses on the use of machine learning techniques for the analysis of computer programs in order to acquire information about an author's gender. There are few existing studies that address the relationship between linguistics and programming; however, in many areas where language is analyzed it is possible to mine important information about the users of that language associated with set of attribute or coding style. In this work we use open source implementations of machine learning algorithms, specifically, nearest neighbor (K*), decision tree (J48), and Bayes classifier (Naïve Bayes). These algorithms were applied to C++ programs which were associated with sociolinguistic information about the program authors. Our goal was to classify the programs according to the gender of the author. As indicated by our initial results we have been able to achieve precision of 72.3%, recall of 72%, and f-measure of 71.9% which demonstrates that we can predict the gender of the authors of C++ programs.
I. INTRODUCTION AND MOTIVATION
IN the field of sociolinguistics it is known that individual differences in the use of a language within a society can affect or reflect social factors. Linguistic variables correlate with social variables such as age, socio-economic status, gender, ethnicity, and region to create sociolinguistic variation [1]. However, very few researchers have applied this analysis to the field of computer programming. We are thus interested in answering the following question: do social factors impact the development of C++ programs? To begin to answer this question here we report on our efforts to categorize C++ programs based on the gender of the programmers.
VII. CONCLUSION AND FUTURE WORK
In this work, we categorized computer programs on the basis of the gender of the computer programmers. We used dataset composed of C++ programs written by male and female programmers. We developed three classification models: nearest neighbor (K∗ ), decision tree (J48), and Bayes classifier (Na¨ıve Bayes). We concluded that for a limited dataset of C++ programs it is possible to utilize machine learning techniques to differentiate between male and female programmers. We are able to achieve 72.3% of precision, 72% of recall, and 71.9% of f-measure which established that social factors such as gender are reflected in the use of the C++ programming language.