arXiv API in R#

by Adam M. Nguyen and Michael T. Moen

Hosted and maintained by Cornell University, arXiv is an open-access and free distribution service containing nearly 2.5 million scholarly articles in fields including physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science and economics at the time of writing. In this tutorial we will introduce how to use the API with some examples, but for larger bulk downloads of data from arXiv, we recommend Kaggle’s arXiv Dataset, which is updated monthly with the full arXiv data set and metadata.

Please see the following resources for more information on API usage:

Acknowledgment: Thank you to arXiv for use of its open access interoperability.

These recipe examples were tested on October 20, 2025.

Setup#

The following packages libraries need to be installed into your environment to run the code examples in this tutorial. These packages can be installed with install.packages().

library(aRxiv)
library(ggplot2)

Retrieving Categories#

Before we get started, a useful function provided by the aRxiv package is arxiv_cats. This returns arXiv subject classification’s abbreviation and corresponding description. Categories are especially important in forming queries to the API so we mention them here first.

# Here are the first 10 categories to showcase the function
head(arxiv_cats[c("category", "field", "short_description")], n=10)
##    category            field                               short_description
## 1     cs.AI Computer Science                         Artificial Intelligence
## 2     cs.AR Computer Science                           Hardware Architecture
## 3     cs.CC Computer Science                        Computational Complexity
## 4     cs.CE Computer Science Computational Engineering, Finance, and Science
## 5     cs.CG Computer Science                          Computational Geometry
## 6     cs.CL Computer Science                        Computation and Language
## 7     cs.CR Computer Science                       Cryptography and Security
## 8     cs.CV Computer Science         Computer Vision and Pattern Recognition
## 9     cs.CY Computer Science                           Computers and Society
## 10    cs.DB Computer Science                                       Databases

2. Retrieving Number of Query Results#

Using the aRxiv package you can also retrieve counts of papers given some query. For example, we can see how many papers our previous “Hydrodynamics” query returns.

# How many papers titles contain hydrodynamics?
arxiv_count('ti:"hydrodynamics"')
## [1] 7272

We can also see how many HEP-th papers there are.

# How many papers fall under the HEP-th category?
arxiv_count("cat: HEP-th")
## [1] 177084

And finally we can see how many HEP-th papers have been published throughout the years.

# Create a vector of years we are interested in, 1991:2023
years <- 1991:2023

# Create empty vector for annual counts
counts <- c()

# Loop through years to create list of counts per year
for(year in years){
  counts <- c(counts, arxiv_count(paste0('cat:HEP-th AND submittedDate:[',year,' TO ',year+1,']')))
}
counts_df <- as.data.frame(cbind(1991:2023,counts))
# Simple base R plot of the data
plot(counts_df,
     main = 'Theoretical High Energy Physics Papers Published per Year',
     xlab = 'Year',
     ylab='Number of Papers')

3. Proportion of Preprints in Hydrodynamics Papers#

arXiv’s repository contains both electronic preprints and and links to post print (e.g. version of record DOI). We will explore the proportion of preprints in the previous “Hydrodynamics” query. This is possible as the doi column returned in the query is empty for those articles that do not have doi, i.e. preprints.

# Count the number of preprints by looking for empty 'doi' columns
hydrodynamic_preprint_count <- sum(hydrodynamic_search$doi == "")

# Calculate a percentage of preprints
percentage_preprints <- (hydrodynamic_preprint_count / nrow(hydrodynamic_search)) * 100

paste0('The percentage of preprints is ',round(percentage_preprints, digits = 2),'%.')
## [1] "The percentage of preprints is 23.93%."