Last week, we had a new journal paper published. It presents a methodology to improve the acccuracy of opinion mining and content classification of textual data retrieved from the web (see citation info below).
The central idea is to separate content classification from opinion mining in different levels. The proposed methodology has been tested with blog comments. In the first level, we perform the content classification using a certain algorithm. Here, we compared k-NN and SVM, with much betters results for SVM, as implemented in the e1071 library in R. This is followed by the detection of different opinion trends (again, comparing k-NN with SVM, with the same favourable result for the latter). Both stages are based on supervised learning, training the classifiers with messages already assorted by human reviewers.
Finally, once we have the blog messages assigned to a certain opinion trend (+1 for positive, 0 for neutral or -1 for negative messages), we can finally apply an unsupervised learning algorithm (in our case, k-Means) to detect relevant topics characterizing the messages in each opinion group.
In this way, separating the different mining taks in consecutive levels instead of trying to mine opinions and classify content at the same time, we can fine-tune the algorithms for each level and obtain more precise results.
Everyone with access to the Springer network (most academic and research centers, I think) can download the online digital paper. Ideally, this article would have been published as open access, but the rates from publishers are still very high in that case, and budgets quite limited…
C. Alfaro, J. Cano-Montero, J. Gómez, J. Moguerza, and F. Ortega, “A multi-stage method for content classification and opinion mining on weblog comments,” pp. 1-17, 2013. [Online]. Available: http://dx.doi.org/10.1007/s10479-013-1449-6