Saving state information between GitLab CI runs

I had a unique scenario where I had to find out if certain files (in a specific directory) changed in between GitLab CI job runs. One of my original ideas was to run jobs on changes to certain files using only:changes (link). This had two problems – first of all, this would run on every commit regardless of which files were changed/added (even with only:changes, the job would be initiated, but would not run any tasks), and that’s a waste of resources. Second, I needed to find out if certain files had changed periodically (more specifically, every Tuesday and Thursday). I thought I would edit a list of changed files using a job every commit, and then use that list for my scheduled job runs on Tuesdays and Thursdays. And that would mean saving state information somewhere, which Gitlab doesn’t provide for explicitly.

In comes Gitlab CI cache, which is meant for caching dependencies between runs. You can also use it for storing arbitrary files. I initially thought I would store a list of changed files in there, but I figured out I could just store the commit ID of the last run in there and use a clever Git command to find out which files had changed between that commit and HEAD/now.

git diff HEAD $(head -1 $LAST_COMMIT_FILE) --name-only | grep $DIR_NAME

Combining all of that, I came up with this .gitlab-ci.yml

image: alpine:latest

LAST_COMMIT_FILE: .commit_for_last_run
TARGET_DIR: Important_files #This can be a random literal

- schedules

- if [ ! -w $LAST_COMMIT_FILE ]; then echo $CI_COMMIT_SHA > $LAST_COMMIT_FILE; fi

# Get files changed since the last time this script was run
- export CHANGED_FILES=$(git diff HEAD $(head -1 $LAST_COMMIT_FILE) --name-only | grep $TARGET_DIR)

# If such files exist, do things
- if [ ! -z "$CHANGED_FILES" ]; then #do_things; else echo "No changes between $(head -1 $LAST_COMMIT_FILE) and $CI_COMMIT_SHA."; fi

# Store current commit ID in the last commit storage file

$CI_COMMIT_ID is a built-in Gitlab CI variable that has the commit id of the last commit before this job, which is also stored in HEAD, I presume. Note that $TARGET_DIR can be a random string and not necessarily the name of a directory, since we will be grepping for it.

Also note that the cache is provided on a best-effort basis, it’s usually stored locally (where the runner resides) unless you have enabled distributed cache and S3 uploading, so this might be theoretically unreliable. Although I have had no indications of unreliability for more than a month in production. Just in case the cache isn’t retrieved successfully, I’m writing the current commit ID into the $LAST_COMMIT_FILE in case that file doesn’t exist.