Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)

Abstract—Social media are increasingly being used as sources in mainstream news coverage. However, since news is so rapidly updating it is very easy to fall into the trap of believing everything as truth. Spam content usually refers to the information that goes viral and skews users views on subject...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Tur, Georvic, Homsi, Masun Nabhan
Formato: Objeto de conferencia
Lenguaje:Inglés
Publicado: 2017
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/63208
http://www.clei2017-46jaiio.sadio.org.ar/sites/default/files/Mem/SLMDI/SLMDI-07.pdf
Aporte de:
id I19-R120-10915-63208
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
spam classification
twitter
topic discovering
cost-sensitive classifier
random forest
spellingShingle Ciencias Informáticas
spam classification
twitter
topic discovering
cost-sensitive classifier
random forest
Tur, Georvic
Homsi, Masun Nabhan
Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
topic_facet Ciencias Informáticas
spam classification
twitter
topic discovering
cost-sensitive classifier
random forest
description Abstract—Social media are increasingly being used as sources in mainstream news coverage. However, since news is so rapidly updating it is very easy to fall into the trap of believing everything as truth. Spam content usually refers to the information that goes viral and skews users views on subjects. Despite recent advances in spam analysis methods, it is still a challenging task to extract accurate and useful information from tweets. This paper aims at introducing a new approach for classification of spam and non-spam tweets using Cost-Sensitive Classifier that includes Random Forest. The approach consisted of three phases: preprocessing, classification and evaluation. In the preprocessing phase, tweets were first annotated manually and then four different sets of features were extracted from them. In the classification phase, four machine learning algorithms were first cross-validated aiming at determining the best base classifier for spam detection. Then, class imbalanced problem was dealt by resampling and incorporating arbitrary misclassification costs into the learning process. In the evaluation phase, the trained algorithm was tested with unseen tweets. Experimental results showed that the proposed approach helped mitigate overfitting and reduced classification error by achieving an overall accuracy of 89.14% in training and 76.82% in testing.
format Objeto de conferencia
Objeto de conferencia
author Tur, Georvic
Homsi, Masun Nabhan
author_facet Tur, Georvic
Homsi, Masun Nabhan
author_sort Tur, Georvic
title Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title_short Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title_full Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title_fullStr Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title_full_unstemmed Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title_sort cost-sensitive classifier for spam detection on news media twitter accounts (revised april 2017)
publishDate 2017
url http://sedici.unlp.edu.ar/handle/10915/63208
http://www.clei2017-46jaiio.sadio.org.ar/sites/default/files/Mem/SLMDI/SLMDI-07.pdf
work_keys_str_mv AT turgeorvic costsensitiveclassifierforspamdetectiononnewsmediatwitteraccountsrevisedapril2017
AT homsimasunnabhan costsensitiveclassifierforspamdetectiononnewsmediatwitteraccountsrevisedapril2017
bdutipo_str Repositorios
_version_ 1764820480564396032