Lenon Oliveira

Downloading and caching large files using Python

While writing a small Python library to download and parse a large CSV file from the web, I had to implement a strategy to cache the file locally and avoid downloading it on every execution. I wanted the library to download the file only once on the first execution and also when it has changed on the server. In this blog post I’m describing how to implement this with Python, basic HTTP headers and file manipulation.

Let’s start by looking at these two HTTP headers used to control cache:

So I can store the value of Last-Modified and send it in the next HTTP request as the value for If-Modified-Since. The server will return a 200 status and a body only if the file has been modified.

Both headers contain a timestamp in the format defined by RFC 7231 as an “HTTP-date” and Python has functions to handle this format. They can be imported from the module email.utils and, despite the module name, are compatible with the HTTP standard. Here’s an example:

from datetime import datetime
from email.utils import parsedate_to_datetime, formatdate

formatdate(datetime.now().timestamp(), usegmt=True)
# 'Sat, 22 May 2021 03:08:49 GMT'

parsedate_to_datetime('Sat, 22 May 2021 03:08:49 GMT')
# datetime.datetime(2021, 5, 22, 3, 8, 49, tzinfo=datetime.timezone.utc)

It is important to pass usegmt=True to formatdate because HTTP dates are always expressed in GMT.

I can use file’s modification time (mtime) to store the modification time indicated by the HTTP response. Python has os.path.getmtime to get modification time and os.utime to change it:

import os
from datetime import datetime

os.utime("hello.txt", times=(datetime.now().timestamp(), 1621653735.0))
os.path.getmtime("hello.txt")
# 1621653735.0

Now, let’s see the actual download function. I’ve used requests to make the HTTP request as follows:

import os
import requests
from datetime import datetime
from email.utils import parsedate_to_datetime, formatdate

def download(url, destination_file):
    headers = {}

    if os.path.exists(destination_file):
        mtime = os.path.getmtime(destination_file)
        headers["if-modified-since"] = formatdate(mtime, usegmt=True)

    response = requests.get(url, headers=headers, stream=True)
    response.raise_for_status()

    if response.status_code == requests.codes.not_modified:
        return

    if response.status_code == requests.codes.ok:
        with open(destination_file, "wb") as f:
            for chunk in response.iter_content(chunk_size=1048576):
                f.write(chunk)

        if last_modified := response.headers.get("last-modified"):
            new_mtime = parsedate_to_datetime(last_modified).timestamp()
            os.utime(destination_file, times=(datetime.now().timestamp(), new_mtime))

Here’s some important parts of this function:

The download function can be used like this:

dataset_url = "https://www.tesourotransparente.gov.br/ckan/dataset/df56aa42-484a-4a59-8184-7676580c81e3/resource/796d2059-14e9-44e3-80c9-2d9e30b405c1/download/PrecoTaxaTesouroDireto.csv"

download(dataset_url, "dataset.csv")

Calling it multiple times will update the dataset only if it has changed on the server, as expected. Depending on the situation, it would be good to also implement a check on the Cache-Control header, but for now this is good enough.

#Python #Http