دانلود رایگان مقاله مطالعه بهینه سازی اسپم فیلتر چند هدفه پوشش حداکثرسازی طبقه بندی سه راه

عنوان فارسی
مطالعه بهینه سازی اسپم فیلتر چند هدفه پوشش حداکثرسازی طبقه بندی سه راه
عنوان انگلیسی
A spam filtering multi-objective optimization study covering parsimony maximization and three-way classification
صفحات مقاله فارسی
0
صفحات مقاله انگلیسی
13
سال انتشار
2016
نشریه
الزویر - Elsevier
فرمت مقاله انگلیسی
PDF
کد محصول
E2171
رشته های مرتبط با این مقاله
مهندسی کامپیوتر و مهندسی فناوری اطلاعات
گرایش های مرتبط با این مقاله
امنیت اطلاعات
مجله
محاسبات کاربردی نرم - Applied Soft Computing
دانشگاه
دانشکده فناوری و مدیریت، علوم کامپیوتر و پژوهش و ارتباطات، موسسه پلی تکنیک لیریا، پرتغال
کلمات کلیدی
فیلتر اسپم، بهینه سازی چند هدفه، طبقه بندی سه راه، طبقه بندی قواعد مبتنی بر
چکیده

ABSTRACT


Classifier performance optimization in machine learning can be stated as a multi-objective optimization problem. In this context, recent works have shown the utility of simple evolutionary multi-objective algorithms (NSGA-II, SPEA2) to conveniently optimize the global performance of different anti-spam filters. The present work extends existing contributions in the spam filtering domain by using three novel indicator-based (SMS-EMOA, CH-EMOA) and decomposition-based (MOEA/D) evolutionary multiobjective algorithms. The proposed approaches are used to optimize the performance of a heterogeneous ensemble of classifiers into two different but complementary scenarios: parsimony maximization and e-mail classification under low confidence level. Experimental results using a publicly available standard corpus allowed us to identify interesting conclusions regarding both the utility of rule-based classification filters and the appropriateness of a three-way classification system in the spam filtering domain

نتیجه گیری

4. Conclusions and future work


In this work, we have evaluated the utility of several multi-objective evolutionary algorithms to optimize rule-based anti-spam filters from different but complementary perspectives. To this end, we presented two experimental case studies where filter complexity and three-way classification strategy were considered as additional objectives. The first scenario (parsimony maximization) revealed that the number of rules could be signifi- cantly reduced without affecting the filter performance. Moreover, experimental results related to the use of a three-way classification approach demonstrated the utility of defining a boundary region (where the classifier confidence is too low) to reduce the number of misclassification errors. In this context, and from the experiments carried out, we would like to emphasize that from the 330 rules that match messages in the SpamAssassin corpus, only 5% to 20% of rules are really needed to achieve an optimal classification. Moreover, and taking into consideration the particular nature of the spam filtering domain, a considerable amount of relevant rules are based on regular expressions. These rules are used to specifically parse and check the e-mail structure, syntax and content, representing a major contribution in anti-spam filtering customization. The design of this type of rules constitutes an important share of the effort made by systems administrator to release novel and accurate anti-spam filters. Therefore, research aiming at the automatic generation of regular expressions from any given corpus is of high interest, having been initially addressed in the work of Basto-Fernandes et al. [56]. With regard to our three-way classification experiments, it was revealed that indicator-based algorithms perform well when carrying out multi-objective optimization of ROC curve performance. The best results for the VUS indicator were achieved by 3DCHEMOA. Additionally, according to SPREAD indicator results, this algorithm also achieves good performance taking into account that this approach does not allow including points in the concave parts of the Pareto front. Finally, with the introduction of an extra ‘unclassified’ label in the filter (targeted to inform the user of those messages with a low confidence level), a considerable improvement in quality can be achieved to avoid harmful misclassifications at low cost for e-mail users (time).


بدون دیدگاه