Open Science Framework (OSF) API in Python#
by Avery Fernandez and Michael T. Moen
The OSF API allows users to fetch metadata and files from the OSF platform. This cookbook will guide you through the setup and usage of the API, including fetching metadata for preprints and downloading PDFs.
Please see the following resources for more information on API usage:
Documentation
Terms of Use
NOTE: The OSF Preprints API limits requests to a maximum of 100 per hour without an access token, but 10,000 requests per day with an access token. See the Error Codes section of the documentation for more info.
These recipe examples were tested on December 3, 2024.
Setup#
Import Libraries#
The following external libraries need to be installed into your enviornment to run the code examples in this tutorial:
We import the libraries used in this tutorial below:
import os
import requests
from dotenv import load_dotenv
import pandas as pd
from time import sleep
Import Access Token#
Authentication is not required to access the OSF API, but will increase your rate limit. You can sign up for one here.
We keep our API key in a .env
file and use the dotenv
library to access it. If you would like to use this method, create a file named .env
in the same directory as this notebook and add the following line to it:
OSF_API_TOKEN="add-your-api-token-here"
load_dotenv()
try:
API_TOKEN = os.environ["OSF_API_TOKEN"]
except KeyError:
print("API key not found. Please set 'OSF_API_TOKEN' in your .env file.")
The OSF API requires the API token to be passed as a header:
HEADERS = {'Authorization': f'Bearer {API_TOKEN}'}
1. Fetching CC-BY 4.0 License Info#
Using the licenses
endpoint, we can find data relating to various licenses. In this example, we limit our search to CC-BY 4.0 licenses.
url = 'https://api.osf.io/v2/licenses?filter[name]=cc-by&filter[name]=4.0'
response = requests.get(url, headers=HEADERS)
data = response.json()
for license in data['data']:
print(license['attributes']['name'])
print(license['attributes']['url'], '\n')
CC-By Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
CC-BY Attribution-No Derivatives 4.0 International
https://creativecommons.org/licenses/by-nd/4.0/legalcode
CC-BY Attribution-NonCommercial 4.0 International
https://creativecommons.org/licenses/by-nc/4.0/legalcode
CC-BY Attribution-NonCommercial-ShareAlike 4.0 International
https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
From the data returned, we can also retrieve the full-text of the licenses.
# Output limited to the first 264 characters for demonstration purposes
print(data['data'][0]['attributes']['text'][:264])
Creative Commons Attribution 4.0 International Public License
By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution 4.0 International Public License ("Public License").
For the next example, we will create the ccby4_ids
dictionary so that we can check the license of preprints when obtaining metadata.
ccby4_ids = {}
for license in data['data']:
ccby4_ids[license['id']] = license['attributes']['name']
ccby4_ids
{'563c1cf88c5e4a3877f9e96a': 'CC-By Attribution 4.0 International',
'60bf983b58510b0009a5a9a4': 'CC-BY Attribution-No Derivatives 4.0 International',
'60bf992258510b0009a5a9a6': 'CC-BY Attribution-NonCommercial 4.0 International',
'60bf99e058510b0009a5a9a9': 'CC-BY Attribution-NonCommercial-ShareAlike 4.0 International'}
2. Fetching Preprint Metadata and PDFs#
In this use case, we will fetch the metadata for preprints that fall under a specified subject and are licensed under CC-BY 4.0 using the preprints
endpoint. The metadata includes titles, publication dates, DOIs, authors, and PDF URLs.
Function to Fetch Preprints Metadata#
This function retrieves the metadata of CC-BY 4.0 preprints for a given subject, using the ccby4_ids
obtained in the previous example to determine whether a preprint is CC-BY 4.0. For the sake of demonstration, only the first 100 preprints returned by the API are examined in this example.
# Function for fetching the metadata of preprints of a subject,
# keeping only CC-BY 4.0 preprints
def fetch_preprints_metadata(subject : str, limit : int=1):
base_url = 'https://api.osf.io/v2/preprints'
params = {
'filter[subjects]': subject,
'page[size]': 100
}
preprints = []
url = base_url
iteration = 0
while url and iteration < limit:
iteration += 1
if url == base_url:
response = requests.get(url, params=params, headers=HEADERS)
else:
response = requests.get(url, headers=HEADERS)
sleep(1)
data = response.json()
# Check if the preprint licenses are CC-BY 4.0
for preprint in data['data']:
# Check if the 'relationships' key is in the 'preprints' dictionary,
# and then if the 'license' key is inside that dictionary
if not preprint.get('relationships', {}).get('license'):
continue
if preprint['relationships']['license']['data']['id'] in ccby4_ids:
preprints.append(preprint)
url = data['links'].get('next')
return preprints
# Retrieve the metadata for the CC-BY 4.0 preprints in the first 100 results
ccby4_metadata = fetch_preprints_metadata(subject='Education', limit=1)
# Print number of CC-BY 4.0 preprints found
len(ccby4_metadata)
67
Function to Get Contributors#
This function will be used by the process_preprints
function below to find the contributors from the preprint metadata.
def get_contributors(contributors_url):
if contributors_url is None:
return []
response = requests.get(contributors_url, headers=HEADERS)
data = response.json()
contributors = []
for contributor in data['data']:
contributors.append(contributor['embeds']['users']['data']['attributes']['full_name'])
return contributors
Function to Get PDF URL#
This function will be used by the get_pdf_url
function below to find the URL of the PDF from the preprint metadata.
def get_pdf_url(files_url):
if files_url is None:
return None
response = requests.get(files_url + '/versions/', headers=HEADERS)
data = response.json()
for file in data['data']:
if file['links'].get('download'):
return file['links']['download']
Processing Preprints#
The following function processes the preprints metadata and downloads the PDFs for preprints that have a CC-BY 4.0 license.
def process_preprints(preprints, subject):
os.makedirs(f"{subject}_pdfs", exist_ok=True)
metadata_list = []
for preprint in preprints:
title = None
date = None
doi = None
reviewed_doi = None
authors = None
pdf_url = None
if preprint.get('attributes', {}).get('title'):
title = preprint['attributes']['title']
if preprint.get('attributes', {}).get('date_published'):
date = preprint['attributes']['date_published']
if preprint.get('attributes', {}).get('doi'):
doi = preprint['attributes']['doi']
if preprint.get('links', {}).get('preprint_doi'):
reviewed_doi = preprint['links']['preprint_doi']
if preprint.get('relationships', {}).get('contributors', {}).get('links', {}).get('related'):
authors = get_contributors(preprint['relationships']['contributors']['links']['related']['href'])
if preprint.get('relationships', {}).get('primary_file', {}).get('links', {}).get('related'):
pdf_url = get_pdf_url(preprint['relationships']['primary_file']['links']['related']['href'])
license = ccby4_ids[preprint['relationships']['license']['data']['id']]
metadata = {
'title': title,
'date': date,
'doi': doi,
'peer_reviewed_doi': reviewed_doi,
'authors': authors,
'pdf_url': pdf_url,
'license': license
}
metadata_list.append(metadata)
# Don't download the PDF if no URL is available or the license isn't regular CC-BY 4.0
if (metadata['pdf_url'] is None or
metadata['license'] != 'CC-By Attribution 4.0 International'):
continue
# Download PDF
pdf_response = requests.get(metadata['pdf_url'], headers=HEADERS)
if metadata['doi']:
pdf_filename = f"{subject}_pdfs/{metadata['doi'].replace('/', '_').replace('?', '')}.pdf"
else:
pdf_filename = f"{subject}_pdfs/{metadata['pdf_url'].split('/')[-2]}.pdf"
with open(pdf_filename, 'wb') as f:
f.write(pdf_response.content)
return metadata_list
Example Usage#
Fetch metadata and download PDFs for the preprints of the subject “Education”.
# Note that this code block might take a few minutes to fully run
metadata_list = process_preprints(ccby4_metadata, 'Education')
df = pd.DataFrame(metadata_list)
df.to_csv('preprints_metadata.csv', index=False)
df.head()
title | date | doi | peer_reviewed_doi | authors | pdf_url | license | |
---|---|---|---|---|---|---|---|
0 | asc2csv: A Python Package for Eye-Tracking Dat... | 2024-12-02T20:51:29.311288 | None | https://doi.org/10.31219/osf.io/vfpy5 | [Mohammad Ahsan Khodami] | https://osf.io/download/674de0298bf0df47ff9b6a... | CC-By Attribution 4.0 International |
1 | Educational Orientation, Actively Open-Minded ... | 2024-12-02T20:42:42.655100 | None | https://doi.org/10.31219/osf.io/ukhc7 | [Thomas Nygren, Maria Rasmusson, Malin Tväråna... | https://osf.io/download/674db6497eb1fc62e36fad... | CC-By Attribution 4.0 International |
2 | Organizational Agility: Does it Play a Role in... | 2024-12-02T18:49:08.337363 | 10.31014/aior.1992.07.04.628 | https://doi.org/10.31219/osf.io/cb5k7 | [Sabbena Nthenya Kivindo, Stephen M.A. Muathe,... | https://osf.io/download/674ac32019576d09905606... | CC-By Attribution 4.0 International |
3 | Demographic Factors and Turnover Intentions am... | 2024-12-02T18:47:34.595032 | 10.31014/aior.1992.07.04.629 | https://doi.org/10.31219/osf.io/54s7k | [Dhruba Prasad Subedi, Dilli Ram Bhandari] | https://osf.io/download/674ac1976a3d9cc3025605... | CC-By Attribution 4.0 International |
4 | Investigating the Predictors of Entrepreneuria... | 2024-12-02T18:46:15.221707 | 10.31014/aior.1992.07.04.630 | https://doi.org/10.31219/osf.io/fzacq | [Ana Mariana, Bram Hadianto] | https://osf.io/download/674abb3a39e47525385607... | CC-By Attribution 4.0 International |
3. Batch Processing for Multiple Subjects#
This example demonstrates how the functions above can be used to retrieve the data and PDFs for multiple subjects.
subjects = [
"Education",
"Social and Behavioral Sciences"
]
# Note that this code block may take a few minutes to run
for subject in subjects:
preprints = fetch_preprints_metadata(subject)
metadata_list = process_preprints(preprints, subject)
df = pd.DataFrame(metadata_list)
df.to_csv(f"{subject}.csv", index=False)
print(f"Saved {len(metadata_list)} preprints for {subject}")
Saved 68 preprints for Education
Saved 76 preprints for Social and Behavioral Sciences