Predicting citation count of Bioinformatics papers
within four years of publication
Abstract

    Nowadays, publishers of scientific journals face the tough task of selecting high quality articles that will attract as many readers as possible from a pool of articles. This is due to the growth of scientific output and literature. The possibility of a journal having a tool capable of predicting the citation count of an article within the first few years after publication would pave the way for new assessment systems.

    This paper presents a new approach based on building several prediction models for the Bioinformatics journal. These models predict the citation count of an article within four years after publication (global models). To build these models, tokens found in the abstracts of Bioinformatics papers have been used as predictive features, along with other features like the journal sections and two-week post publication periods. To improve the accuracy of the global models, specific models have been built for each Bioinformatics journal section (Data and Text Mining, Databases and Ontologies, Gene Expression, Genetics and Population Analysis, Genome Analysis, Phylogenetics, Sequence Analysis, Structural Bioinformatics and Systems Biology) . In these new models, the average success rate for predictions using the naive Bayes and logistic regression supervised classification methods was 89.4% and 91.5%, respectively, within the nine sections and for the four-year time horizon.




1. Introduction

   Publishers nowadays face the problem of deciding which of the many papers they receive are of higher quality for publication in their journals. The current method used for article assessment is peer review. This process involves two or more authors reading and discussing different papers to determine the validity of the ideas and results, and their potential impact on the world of science.

   Although if used properly peer review is assumed to be the most reliable system, it is slow, expensive and unwieldy (Mulligan (2005); Scarpa (2006); Cobo et al. (2007)). Other authors contest this appraisal (Horrobin (2001); Hanks (2005)). This difference of opinion among authors has led to the development of several quantitative metrics associated with scientific production. One such metric is citation count. Citation count is the number of citations received by a paper in a period of time. Although citations are a measure of visibility, they can be considered as an indirect measure of article quality. The aim of this measure is to mirror the impact and quality of papers (Bornmann and Daniel (2008)).

   Our work is based on the construction of predictive models to forecast the citation count of a paper within four years after publication. For this study we focus on papers published in Bioinformatics from January 1, 2005 to December 31, 2007. The supervised classification methods used in this paper are Bayesian networks (naive Bayes and K2), logistic regression, decision trees and the k-nearest neighbor algorithm. These methods will be compared with each other.
Alfonso Ibáñez, Pedro Larrañaga and Concha Bielza




Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid. 28660 Madrid, Spain
aibanez@fi.upm.es, pedro.larranaga@fi.upm.es, mcbielza@fi.upm.es



Predicting citation count of Bioinformatics papers
within four years of publication
HOME            DATA            RESULTS            EXPLOITING BEST MODELS
Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid. 28660 Madrid, Spain