Open Science Framework (OSF) API in Python

Open Science Framework (OSF) API in Python#

by Avery Fernandez and Michael T. Moen

The OSF API allows users to fetch metadata and files from the OSF platform. This cookbook will guide you through the setup and usage of the API, including fetching metadata for preprints and downloading PDFs.

Please see the following resources for more information on API usage:

Documentation
- OSF API Documentation
Terms of Use
- OSF API Terms of Use

NOTE: The OSF Preprints API limits requests to a maximum of 100 per hour without an access token, but 10,000 requests per day with an access token. See the Error Codes section of the documentation for more info.

These recipe examples were tested on December 3, 2024.

Setup#

Import Libraries#

The following external libraries need to be installed into your enviornment to run the code examples in this tutorial:

We import the libraries used in this tutorial below:

import os
import requests
from dotenv import load_dotenv
import pandas as pd
from time import sleep

Import Access Token#

Authentication is not required to access the OSF API, but will increase your rate limit. You can sign up for one here.

We keep our API key in a .env file and use the dotenv library to access it. If you would like to use this method, create a file named .env in the same directory as this notebook and add the following line to it:

OSF_API_TOKEN="add-your-api-token-here"

load_dotenv()
try:
    API_TOKEN = os.environ["OSF_API_TOKEN"]
except KeyError:
    print("API key not found. Please set 'OSF_API_TOKEN' in your .env file.")

The OSF API requires the API token to be passed as a header:

HEADERS = {'Authorization': f'Bearer {API_TOKEN}'}

1. Fetching CC-BY 4.0 License Info#

Using the licenses endpoint, we can find data relating to various licenses. In this example, we limit our search to CC-BY 4.0 licenses.

url = 'https://api.osf.io/v2/licenses?filter[name]=cc-by&filter[name]=4.0'
response = requests.get(url, headers=HEADERS)
data = response.json()
for license in data['data']:
    print(license['attributes']['name'])
    print(license['attributes']['url'], '\n')

CC-By Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode 

CC-BY Attribution-No Derivatives 4.0 International
https://creativecommons.org/licenses/by-nd/4.0/legalcode 

CC-BY Attribution-NonCommercial 4.0 International
https://creativecommons.org/licenses/by-nc/4.0/legalcode 

CC-BY Attribution-NonCommercial-ShareAlike 4.0 International
https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode 

From the data returned, we can also retrieve the full-text of the licenses.

# Output limited to the first 264 characters for demonstration purposes
print(data['data'][0]['attributes']['text'][:264])

Creative Commons Attribution 4.0 International Public License

By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution 4.0 International Public License ("Public License").

For the next example, we will create the ccby4_ids dictionary so that we can check the license of preprints when obtaining metadata.

ccby4_ids = {}
for license in data['data']:
    ccby4_ids[license['id']] = license['attributes']['name']

ccby4_ids

{'563c1cf88c5e4a3877f9e96a': 'CC-By Attribution 4.0 International',
 '60bf983b58510b0009a5a9a4': 'CC-BY Attribution-No Derivatives 4.0 International',
 '60bf992258510b0009a5a9a6': 'CC-BY Attribution-NonCommercial 4.0 International',
 '60bf99e058510b0009a5a9a9': 'CC-BY Attribution-NonCommercial-ShareAlike 4.0 International'}

2. Fetching Preprint Metadata and PDFs#

In this use case, we will fetch the metadata for preprints that fall under a specified subject and are licensed under CC-BY 4.0 using the preprints endpoint. The metadata includes titles, publication dates, DOIs, authors, and PDF URLs.

Function to Fetch Preprints Metadata#

This function retrieves the metadata of CC-BY 4.0 preprints for a given subject, using the ccby4_ids obtained in the previous example to determine whether a preprint is CC-BY 4.0. For the sake of demonstration, only the first 100 preprints returned by the API are examined in this example.

# Function for fetching the metadata of preprints of a subject,
# keeping only CC-BY 4.0 preprints
def fetch_preprints_metadata(subject : str, limit : int=1):
    base_url = 'https://api.osf.io/v2/preprints'
    params = {
        'filter[subjects]': subject,
        'page[size]': 100
    }
    
    preprints = []
    url = base_url
    iteration = 0
    while url and iteration < limit:
        iteration += 1
        if url == base_url:
            response = requests.get(url, params=params, headers=HEADERS)
        else:
            response = requests.get(url, headers=HEADERS)
        sleep(1)
        data = response.json()

        # Check if the preprint licenses are CC-BY 4.0
        for preprint in data['data']:

            # Check if the 'relationships' key is in the 'preprints' dictionary,
            # and then if the 'license' key is inside that dictionary
            if not preprint.get('relationships', {}).get('license'):
                continue
            
            if preprint['relationships']['license']['data']['id'] in ccby4_ids:
                preprints.append(preprint)

        url = data['links'].get('next')
    return preprints

# Retrieve the metadata for the CC-BY 4.0 preprints in the first 100 results
ccby4_metadata = fetch_preprints_metadata(subject='Education', limit=1)

# Print number of CC-BY 4.0 preprints found
len(ccby4_metadata)

Function to Get Contributors#

This function will be used by the process_preprints function below to find the contributors from the preprint metadata.

def get_contributors(contributors_url):
    if contributors_url is None:
        return []
    response = requests.get(contributors_url, headers=HEADERS)
    data = response.json()
    contributors = []
    for contributor in data['data']:
        contributors.append(contributor['embeds']['users']['data']['attributes']['full_name'])
    return contributors

Function to Get PDF URL#

This function will be used by the get_pdf_url function below to find the URL of the PDF from the preprint metadata.

def get_pdf_url(files_url):
    if files_url is None:
        return None

    response = requests.get(files_url + '/versions/', headers=HEADERS)
    data = response.json()

    for file in data['data']:
        if file['links'].get('download'):
            return file['links']['download']

Processing Preprints#

The following function processes the preprints metadata and downloads the PDFs for preprints that have a CC-BY 4.0 license.

def process_preprints(preprints, subject):
    os.makedirs(f"{subject}_pdfs", exist_ok=True)

    metadata_list = []
    
    for preprint in preprints:
        title = preprint.get('attributes', {}).get('title')
        date =  preprint.get('attributes', {}).get('date_published')
        doi = preprint.get('attributes', {}).get('doi')
        reviewed_doi = preprint.get('links', {}).get('preprint_doi')

        contributors_url = preprint.get('relationships', {}).get(
            'contributors', {}).get('links', {}).get('related', {}).get('href')
        authors = get_contributors(contributors_url)

        pdf_url = preprint.get('relationships', {}).get('primary_file', {}).get(
            'links', {}).get('related', {}).get('href')

        license = ccby4_ids[preprint['relationships']['license']['data']['id']]

        metadata = {
            'title': title,
            'date': date,
            'doi': doi,
            'peer_reviewed_doi': reviewed_doi,
            'authors': authors,
            'pdf_url': pdf_url,
            'license': license
        }
        metadata_list.append(metadata)

        # Don't download the PDF if no URL is available or the license isn't regular CC-BY 4.0
        if (metadata['pdf_url'] is None or
            metadata['license'] != 'CC-By Attribution 4.0 International'):
            continue

        # Download PDF
        pdf_response = requests.get(metadata['pdf_url'], headers=HEADERS)
        if doi:
            pdf_filename = f"{subject}_pdfs/{doi.replace('/', '_').replace('?', '')}.pdf"
        else:
            pdf_filename = f"{subject}_pdfs/{pdf_url.split('/')[-2]}.pdf"
        with open(pdf_filename, 'wb') as f:
            f.write(pdf_response.content)
    
    return metadata_list

Example Usage#

Fetch metadata and download PDFs for the preprints of the subject “Education”.

# Note that this code block might take a few minutes to fully run
metadata_list = process_preprints(ccby4_metadata, 'Education')
df = pd.DataFrame(metadata_list)
df.to_csv('preprints_metadata.csv', index=False)
df.head()

	title	date	doi	peer_reviewed_doi	authors	pdf_url	license
0	asc2csv: A Python Package for Eye-Tracking Dat...	2024-12-02T20:51:29.311288	None	https://doi.org/10.31219/osf.io/vfpy5	[Mohammad Ahsan Khodami]	https://osf.io/download/674de0298bf0df47ff9b6a...	CC-By Attribution 4.0 International
1	Educational Orientation, Actively Open-Minded ...	2024-12-02T20:42:42.655100	None	https://doi.org/10.31219/osf.io/ukhc7	[Thomas Nygren, Maria Rasmusson, Malin Tväråna...	https://osf.io/download/674db6497eb1fc62e36fad...	CC-By Attribution 4.0 International
2	Organizational Agility: Does it Play a Role in...	2024-12-02T18:49:08.337363	10.31014/aior.1992.07.04.628	https://doi.org/10.31219/osf.io/cb5k7	[Sabbena Nthenya Kivindo, Stephen M.A. Muathe,...	https://osf.io/download/674ac32019576d09905606...	CC-By Attribution 4.0 International
3	Demographic Factors and Turnover Intentions am...	2024-12-02T18:47:34.595032	10.31014/aior.1992.07.04.629	https://doi.org/10.31219/osf.io/54s7k	[Dhruba Prasad Subedi, Dilli Ram Bhandari]	https://osf.io/download/674ac1976a3d9cc3025605...	CC-By Attribution 4.0 International
4	Investigating the Predictors of Entrepreneuria...	2024-12-02T18:46:15.221707	10.31014/aior.1992.07.04.630	https://doi.org/10.31219/osf.io/fzacq	[Ana Mariana, Bram Hadianto]	https://osf.io/download/674abb3a39e47525385607...	CC-By Attribution 4.0 International

3. Batch Processing for Multiple Subjects#

This example demonstrates how the functions above can be used to retrieve the data and PDFs for multiple subjects.

subjects = [
    "Education",
    "Social and Behavioral Sciences"
]

# Note that this code block may take a few minutes to run
for subject in subjects:
    preprints = fetch_preprints_metadata(subject)
    metadata_list = process_preprints(preprints, subject)
    df = pd.DataFrame(metadata_list)
    df.to_csv(f"{subject}.csv", index=False)
    print(f"Saved {len(metadata_list)} preprints for {subject}")

Saved 68 preprints for Education
Saved 76 preprints for Social and Behavioral Sciences