Sage Journals Text and Data Mining in Python

Sage Journals Text and Data Mining in Python#

by Michael T. Moen

Sage Journals allow downloading of articles for which you have legitimate access (e.g. open access articles and those included in your institution’s subscription) for non-commercial text and data mining (see restrictions in terms below). Access to text and data mining with Sage resources requires prior approval. Contact UA Libraries or your institution to check their agreement and enable access. Please see the following resources below for more information on Sage text and data mining, API usage, and policies/terms:

This tutorial content is intended to help facilitate academic research. Please check your institution for their Text and Data Mining or related License Agreement with Sage Journals.

Documentation
- Sage Journals
Terms
Data Reuse
- Sage Policy on Text and Data Mining (TDM) and Artificial Intelligence (AI)

This recipe uses the CrossRef API to obtain the full-text URLs of the articles, as recommended in Sage’s Text and Data Mining overview. For more information on usage for this API, please see our CrossRef cookbook tutorials and the text and data mining for researchers page of CrossRef’s API documentation.

NOTE: Sage Journals limits downloads to a maximum of 1 request every 6 seconds Monday to Friday from midnight to noon in the “America/Los_Angeles” timezone, and 1 request every 2 seconds outside of this time slot.

These recipe examples were tested on October 14, 2025.

Setup#

Import Libraries#

The following external libraries need to be installed into your environment to run the code examples in this tutorial:

We import the libraries used in this tutorial below:

import requests
from dotenv import load_dotenv
import os
from time import sleep

Import Email#

The CrossRef API requires users to provide an email address in API requests.

We keep our email address in a .env file and use the dotenv library to access it. If you would like to use this method, create a .env file and add the following line to it:

EMAIL=PUT_YOUR_EMAIL_HERE

load_dotenv()
try:
    EMAIL = os.environ["EMAIL"]
except KeyError:
    print("EMAIL not found. Please set 'EMAIL' in your .env file.")

Enable Text and Data Mining with Sage#

Access to text and data mining on Sage requires approval. Contact UA Libraries or your institution to check their agreement and enable access.

1. Retrieve a Full-Text Article as a PDF#

To begin, let’s consider a simple example where we retrieve the full-text of an article.

For this example, we look at the following article licensed under CC BY 4.0:

https://doi.org/10.1177/14759217221075241

Sage permits non-commerical TDM for articles to those you have legitimate access to. If you can view the full text for the article of the DOI above in your browser, you should be able to access it programmatically below once you receive approval by Sage.

def get_pdf_url(doi : str) -> str:
    """Use the CrossRef API to obtain the PDF TDM link for the given DOI"""
    data = requests.get(f'https://api.crossref.org/works/{doi}?mailto={EMAIL}').json()
    for link in data['message']['link']:
        if (link['content-type'] == 'application/pdf' and
            link['intended-application'] == 'text-mining'):
            return link['URL']

doi = 'https://doi.org/10.1177/14759217221075241'
full_text_url = get_pdf_url(doi)
full_text_url

'https://journals.sagepub.com/doi/pdf/10.1177/14759217221075241'

With the URL for the article full text, we can now retrieve the data from Sage.

def get_article_full_text(url : str):
    """Retrieve the full-text of an article from Sage"""
    response = requests.get(url)
    if response.status_code == 200:
        # Status code 200 indicates success
        if 'https://journals.sagepub.com/doi/abs/' in response.url:
            # If you do not have access to an article, your query will redirect to the abstract
            print('ERROR: You do not appear to have access to the full-text for this article.')
        else:
            return response
    elif response.status_code == 403:
        # Status code 403 indicates that the requested object is forbidden
        print('ERROR: Access to TDM on Sage requires approval.')
        print('Contact UA Libraries or your institution for more guidance.')
    else:
        print(f'ERROR: {response.status_code}')
    return None

response = get_article_full_text(full_text_url)

Since our query was successful, we download the article full-text as a PDF below:

def download_pdf(response : requests.models.Response, filename : str) -> None:
    """Download the full-text for an article"""
    with open(filename, 'wb') as f:
        f.write(response.content)

download_pdf(response, "article.pdf")

2. Retrieve Full-Text PDF Articles in a Loop#

Using the functions defined in the previous example, we can retrieve the full-text of several articles in a loop.

# These articles are licensed under CC BY 4.0: https://creativecommons.org/licenses/by/4.0/
dois = [
    'https://doi.org/10.3233/NAI-240767',
    'https://doi.org/10.1177/20539517221145372',
    'https://doi.org/10.1177/09544062231164575',
    'https://doi.org/10.1177/2053951717743530',
    'https://doi.org/10.1177/00405175221145571'
]

for idx, doi in enumerate(dois):
    url = get_pdf_url(doi)
    response = get_article_full_text(url)
    sleep(1)
    if not response:
        print(f'ERROR: Could not download {url}')
        continue
    filename = f'article{idx+1}.pdf'
    download_pdf(response, filename)
    print(f'{url} successfully downloaded as {filename}')

https://journals.sagepub.com/doi/pdf/10.3233/NAI-240767 successfully downloaded as article1.pdf
https://journals.sagepub.com/doi/pdf/10.1177/20539517221145372 successfully downloaded as article2.pdf
https://journals.sagepub.com/doi/pdf/10.1177/09544062231164575 successfully downloaded as article3.pdf
http://journals.sagepub.com/doi/pdf/10.1177/2053951717743530 successfully downloaded as article4.pdf
https://journals.sagepub.com/doi/pdf/10.1177/00405175221145571 successfully downloaded as article5.pdf

3. Retrieve a Full-Text Article as a XML#

This example uses the same article as section 1, retrieving the data as XML rather than a PDF.

def get_xml_url(doi : str) -> str:
    """Use the CrossRef API to obtain the XML TDM link for the given DOI"""
    data = requests.get(f'https://api.crossref.org/works/{doi}?mailto={EMAIL}').json()
    for link in data['message']['link']:
        if (link['content-type'] == 'application/xml' and
            link['intended-application'] == 'text-mining'):
            return link['URL']

doi = 'https://doi.org/10.1177/14759217221075241'
full_text_url = get_xml_url(doi)
full_text_url

'https://journals.sagepub.com/doi/full-xml/10.1177/14759217221075241'

response = get_article_full_text(full_text_url)

def download_xml(response : requests.models.Response, filename : str) -> None:
    """Download the full-text for an article"""
    with open(filename, 'wb') as f:
        f.write(response.content)

download_pdf(response, "article.xml")

4. Retrieve Full-Text XML Articles in a Loop#

This example uses the same articles from section 2, retrieving the data as XML rather than PDFs.

# We use the same list of DOIs from section 2
for idx, doi in enumerate(dois):
    url = get_xml_url(doi)
    response = get_article_full_text(url)
    sleep(1)
    if not response:
        print(f'ERROR: Could not download {url}')
        continue
    filename = f'article{idx+1}.xml'
    download_xml(response, filename)
    print(f'{url} successfully downloaded as {filename}')

https://journals.sagepub.com/doi/full-xml/10.3233/NAI-240767 successfully downloaded as article1.xml
https://journals.sagepub.com/doi/full-xml/10.1177/20539517221145372 successfully downloaded as article2.xml
https://journals.sagepub.com/doi/full-xml/10.1177/09544062231164575 successfully downloaded as article3.xml
http://journals.sagepub.com/doi/full-xml/10.1177/2053951717743530 successfully downloaded as article4.xml
https://journals.sagepub.com/doi/full-xml/10.1177/00405175221145571 successfully downloaded as article5.xml