Abstract:
Authorship attribution is the process of identifying the author of a given text and from the machine learning perspective, it can be seen as a classification problem. To create the largest publicly available authorship attribution dataset we've extracted the works of 50 well-known Victorian-era authors. All of these extracted works are novels. In order to create non-exhaustive learning problem, we've provided 45 authors in training and 50 authors in the testing data. 5 missing authors in testing consist of %34 of all testing set. Each instance then represented with a 1000 word pieces for each author. There are 93600 text piece instance in total each which consist of 1000 words. To make the problem a bit more challenging, we've separated different books for both training and testing. We have performed 5 main feature extraction technique on this data and compared the performance of such features within different classifiers and deep learning structures. The usage of Word2Vec in authorship attribution problem is also introduced with two main approaches: author based Word2Vec training and treating each author's text pieces individually. Support vector machine classifiers with nu-SVC type is observed to give best success rates on the stacked useful feature set.
Description:
The data was extracted through https://blog.gdeltproject.org/ using Google Big Query. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —http://gdeltproject.org/about.html. The GDELT Project is an open platform for research and analysis of global society and thus all datasets released by the GDELT Project are available for unlimited and unrestricted use for any academic, commercial, or governmental use of any kind without fee.