Use git-lfs to Manage Large Files in My Blog

Table of Contents

Motivation

I have been a fan of org-mode for more than four years, I use org-mode to record my feelings, write notes/blogs, and organize my life. Though most of my notes and blog posts are only halfway done, I still feel it beneficial to write down something.

Most of my friends are using notion to organize their notes and to discuss ideas. I admit that notion is better for teamwork and might have a large user base. However, I prefer a private informal space (like this blog) where I can write random text whenever I want, and I should have complete control of this space (all contents should have offline backups and version control).

Fortunately, org-mode has all of the functionality I want: a powerful markup language with code and formula rendering support, jupyter-like interactive code blocks, decent to-do list (though I don't always finish things on time), and tools for exporting org files to html/latex so I that I can share some of my posts with others. Compared to markdown and static site generators, org-mode is not only a markup language but also an ecosystem that has tight integration with emacs. Honestly speaking, I don't use emacs to write code anymore1 but it's still highly customizable and convenient for writing articles.

My blog is basically an org-mode repository where I use git for version control, I bought a VPS as a git server, and whenever there is a commit, the server would execute emacs' org-export function to generate static HTML files, this could be implemented with a post-receive hook on git server easily.

Git was originally proposed for version control of codebase. However, I use figures to explain ideas in my blog post sometimes, and I also organize my photos with org-mode. It's well known that git is not good at large binary files 2, one alternative is to use a standalone service to host images such as Amazon S3, considering I already own a VPS I feel it unnecessary to rent another service.

git-lfs was proposed to aid git with large file storage. The idea of git-lfs is simple, in their own words:

Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.

User can specify some file types as binary files, text files and binary files are treated differently: binary files are stored in a standalone large file storage, while their pointers are treated as ordinary text. There is an external smudge phase where pointers are rewritten to their contents when we perform git lfs pull or something similar.

Install git-lfs on self-hosted server

git-lfs team maintains an official release of git-lfs client, but they do not provide an up-to-date official git-lfs server implementation: instead, they have designed a set of standard APIs so that different git-lfs service providers can implement these APIs in their own ways. Github and Atlassian repos has lfs support but Github only provides 2GB free lfs budget, however, my blog repository has grown beyond this size, and personally, I prefer using git-lfs on a self-hosted server instead of Github. Here is a list of unofficial implementations of git-lfs server but most of them are out-of-maintenance or personal projects just for fun. For sake of safety, I choose gitea because it has an active open-source community (I suppose gitlab's user experience is similar).

Setting up gitea

The installation experience is smooth, I just followed these instructions and everything works as expected.

I use acme.sh to get ssl certificate for my blog. To enable https for gitea, what we need is to add another domain and webroot when issuing a certificate. Note that the webroot of gitea resides in $GITEA_CUSTOM , whole value defaults to $GITEA_WORK_DIR/custom .

Migrating previous repo to gitea

Gitea can import repositories that exist on the server: move the repo to the user working directory, click site administration on gitea website and pick up the repo from "unadopted repositories", you will find the repo appears in your profile.

Gitea would rewrite the hook files with its own: e.g. translating the local path to the absolute path. If you have your ssh public key copied to authorized keys under git user in the server, be sure to remove it and add your key to SSH keys managed by gitea. Otherwise, the git request would bypass gitea and the path would not be translated correctly.3

Deploying Blog

The remaining step is to rewrite the post-receive git hook to trigger the blog deployment script per commit. I used to run the command git --work-tree=/var/www/YOURWEBSITE checkout -f to checkout files from the bare repository in the git server. However, this no longer works after I enable lfs.

The first error I encountered is missing protocol, this is because the old version of git lfs does not support local directory to directory clones, and we must use https / ssh protocols to clone a local repository. This sounds so stupid, isn't it? Fortunately, the later version of git-lfs (2.10+) had already supported a new protocol called file, which means local file transfer.

After I upgraded git-lfs to 3.2, there comes another issue, git-lfs hints to me:

Error downloading object: ???.jpg (hash): Smudge error: Error downloading ???.jpg (hash): error transferring "hash": [0] remote missing object hash 

I investigated the git repository on server side and found the bare repo's lfs folder is empty: all lfs objects are stored in gitea's lfs path (I am very confident that this is gitea's behavior).

Workaround

I wrote a script to link lfs objects in gitea's lfs path to the repo's local lfs folder before git checkout -f, here is the full deployment script: deploy:

#!/usr/bin/env bash
export COMMIT_ID=$(git rev-parse --short HEAD)
export WORK_TREE=/var/www/${DOMAIN}
export LOG_PATH=/var/log/${DOMAIN}/${COMMIT_ID}
export GIT_DIR=$(git rev-parse --git-dir)
mkdir -p ${LOG_PATH}

echo "link gitea lfs objects: starts"
if [ ! -d ${GIT_DIR}/lfs/objects ]; then
    echo "${GIT_DIR}/lfs/objcets do not exist, create directory..." > ${LOG_PATH}/LOG
    mkdir -p ${GIT_DIR}/lfs/objects 2>${LOG_PATH}/ERR
else
    echo "${GIT_DIR}/lfs/objects already exists" > ${LOG_PATH}/LOG
fi

for oid in $(git lfs ls-files -l | cut -c -64); do
    prefix_0=${oid:0:2}
    prefix_1=${oid:2:2}
    suffix=${oid:4:60}
    if [ ! -d ${GIT_DIR}/lfs/objects/$prefix_0/$prefix_1 ]; then
        echo "${GIT_DIR}/lfs/objects/$prefix_0/$prefix_1 do not exist, create directory..." > ${LOG_PATH}/LOG
        mkdir -p ${GIT_DIR}/lfs/objects/$prefix_0/$prefix_1 2>${LOG_PATH}/ERR
    else
        echo "${GIT_DIR}/lfs/objects/$prefix_0/$prefix_1 already exists" > ${LOG_PATH}/LOG
    fi
    if [ ! -f "${GIT_DIR}/lfs/objects/$prefix_0/$prefix_1/$prefix_0$prefix_1$suffix" ]; then
        echo "${GIT_DIR}/lfs/objects/$prefix_0/$prefix_1/$prefix_0$prefix_1$suffix do not exist, link from gitea lfs..." > ${LOG_PATH}/LOG
        ln -s ${GITEA_LFS_PATH}/$prefix_0/$prefix_1/$suffix ${GIT_DIR}/lfs/objects/$prefix_0/$prefix_1/$prefix_0$prefix_1$suffix 2>${LOG_PATH}/ERR
    else
        echo "${GIT_DIR}/lfs/objects/$prefix_0/$prefix_1/$prefix_0$prefix_1$suffix already exists" > ${LOG_PATH}/LOG
    fi
done
echo "link gitea lfs objects: finished"

# deploy blog
echo "git checkout files"
git --work-tree=${WORK_TREE} checkout -f > ${LOG_PATH}/LOG 2> ${LOG_PATH}/ERR
cd ${WORK_TREE}
echo "make html"
make all > ${LOG_PATH}/LOG 2> ${LOG_PATH}/ERR

where GITEA_LFS_PATH refers to gitea's LFS_CONTENT_PATH . The deploy script is placed under hooks/post-receive.d/, and we need to modify the permission of the script to enable execution by chmod +rwx deploy .

Experience

I found it not an easy job to upgrade my blog repo to use git-lfs and modify the deployment process correspondingly. But anyways it really improves my experience, the waiting time per pull/push is reduced significantly. There is a famous article encourages people not to use git-lfs, though I agree on some of its arguments, I will stick to my decision until there is an (apparently) better choice than git-lfs.

Footnotes:

1

I trust the emacs community but vscode has better remote development support.

2

Git compresses git objects (including tree and blob) to packfiles with zlib, however, non-text large files such as jpeg have been compressed, and compress them again would not bring extra benefit. Compressing and decompress large files is time-consuming.

Author: expye(Zihao YE)

Email: expye@outlook.com

Date: 2022-11-22 Tue 00:00

Last modified: 2024-07-04 Thu 10:35

Licensed under CC BY-NC 4.0