Downloading and caching large files using Python
While writing a small Python library to download and parse a large CSV file from the web, I had to implement a strategy to cache the file locally and avoid downloading it on every execution. I wanted the library to download the file only once on the first execution and also when it has changed on the server. In this blog post I’m describing how to implement this with Python, basic HTTP headers and file manipulation.
Let’s start by looking at these two HTTP headers used to control cache:
Last-Modified
: according to MDN, this is a response header sent by HTTP servers and it “contains the date and time at which the origin server believes the resource was last modified”.If-Modified-Since
: also according to MDN, this is a request header and it “makes the request conditional: the server will send back the requested resource, with a 200 status, only if it has been last modified after the given date. If the resource has not been modified since, the response will be a 304 without any body”.
So I can store the value of Last-Modified
and send it in the next HTTP request
as the value for If-Modified-Since
. The server will return a 200 status and a
body only if the file has been modified.
Both headers contain a timestamp in the format defined by RFC 7231 as
an “HTTP-date” and Python has functions to handle this format. They can be
imported from the module email.utils
and, despite the module name, are
compatible with the HTTP standard. Here’s an example:
from datetime import datetime
from email.utils import parsedate_to_datetime, formatdate
formatdate(datetime.now().timestamp(), usegmt=True)
# 'Sat, 22 May 2021 03:08:49 GMT'
parsedate_to_datetime('Sat, 22 May 2021 03:08:49 GMT')
# datetime.datetime(2021, 5, 22, 3, 8, 49, tzinfo=datetime.timezone.utc)
It is important to pass usegmt=True
to formatdate
because HTTP dates are
always expressed in GMT.
I can use file’s modification time (mtime
) to store the modification time
indicated by the HTTP response. Python has
os.path.getmtime
to get modification time and
os.utime
to change it:
import os
from datetime import datetime
os.utime("hello.txt", times=(datetime.now().timestamp(), 1621653735.0))
os.path.getmtime("hello.txt")
# 1621653735.0
Now, let’s see the actual download function. I’ve used requests to make the HTTP request as follows:
import os
import requests
from datetime import datetime
from email.utils import parsedate_to_datetime, formatdate
def download(url, destination_file):
headers = {}
if os.path.exists(destination_file):
mtime = os.path.getmtime(destination_file)
headers["if-modified-since"] = formatdate(mtime, usegmt=True)
response = requests.get(url, headers=headers, stream=True)
response.raise_for_status()
if response.status_code == requests.codes.not_modified:
return
if response.status_code == requests.codes.ok:
with open(destination_file, "wb") as f:
for chunk in response.iter_content(chunk_size=1048576):
f.write(chunk)
if last_modified := response.headers.get("last-modified"):
new_mtime = parsedate_to_datetime(last_modified).timestamp()
os.utime(destination_file, times=(datetime.now().timestamp(), new_mtime))
Here’s some important parts of this function:
response.raise_for_status()
will raise an error for 4xx (client errors) or 5xx (server errors).requests.get(..., stream=True)
andresponse.iter_content(chunk_size=1048576)
are used to iterate over the response data and are important to avoid reading the full dataset into memory.chunk_size
is the number of bytes which is, in this case, 1 MiB.
The download
function can be used like this:
dataset_url = "https://www.tesourotransparente.gov.br/ckan/dataset/df56aa42-484a-4a59-8184-7676580c81e3/resource/796d2059-14e9-44e3-80c9-2d9e30b405c1/download/PrecoTaxaTesouroDireto.csv"
download(dataset_url, "dataset.csv")
Calling it multiple times will update the dataset only if it has changed on the
server, as expected. Depending on the situation, it would be good to also
implement a check on the Cache-Control
header, but for now this is good
enough.