Metadata

Tracking file metadata is not simple. There are many different solutions proposed, but none seems to work for all situations (and neither will Kea). A few solutions that are out there are:

  • Within a file, for example JPEG images have exif data
  • In the filesystem. All filesystems store some metadata (as the filename, location on the filesystem, etc). Some filesystems allow for storage of more metadata (e.g. xattr - extended attributes).
  • As sidecar files, i.e. each file has a accompanying file that stores metadata.
  • In a database

Kea links metadata (basically, any key value pair that you like) to the sha256 checksum of a file. The Kea implementation has the following advantages.

  • It does not interfere with your files, they do not change. None of your software needs to be aware that you are storing metadata.
  • No sidecar files, no forgetting to copy, move or rename them.
  • Independent of OS & filesystem. Xattr is nice, but not universal and does (for example) not work across NFS.
  • Easy to export, you can dump a file with checksum & metadata that you can process.
  • Sha256 is a strong, cross platform, algorithm.

There are downsides, though:

  • It can be slow - calculating the checksum of a large file takes time. Kea does cache checksums and only recalculates if modification time, size or path change.
  • You need to backup & maintain your database.

Using Kea to store metadata

Kea can be operated as a command line tool called k3 (It’s the third iteration of this tool).

For this part of the documentation we will work with one demo file. To create a file to work on:

$ echo "Nestor notabilis" > kea.txt

For later reference, lets check the sha256 checksum:

$ sha256sum kea.txt
25802ac7e79b7134ca8968790909264e9d7a23a6a0a4ed6c00caff7c818b83de  kea.txt

If you want to associate metadata to a file, you can use k3:

$ k3 set test any_test_value file.txt

You can then check if it worked by running:

$ k3 show file.txt
sha256     : a23e5fdcd7b276bdd81aa1a0b7b963101863dd3f61ff57935f8c5ba462681ea6
sha1       : a5c341bec5c89ed16758435069e3124b3685ad93
short      : foj5f3Neyd
md5        : 4d93d51945b88325c213640ef59fc50b
# Files (mtime, hostname, path)
2018-08-31T15:16:50.132572 gbw-d-l0067 /home/luna.kuleuven.be/u0089478/project/kea3/docs/file.txt **
# Metadata:
 - size       : 10
 - test       : any_value

There are a few things to unpack here:

  • First of all, you see under the header # Metadata the field we just set. That’s good.
  • Secondly, at the top there are a number of checksums. sha256. Kea does also store the sha1 and md5 (mostly for cross referencing with an older system using sha). There is one extra checksum: short which always starts with a f and then has the first 9 characters of a base64 conversion of sha256 checksum. The short checksum is used only for convencience, within Kea. It is easier to copy paste short ids than the long sha256 ones.
  • Given that Kea links metadata to a checksum, not a file, kea essentially show data associated with a checksum, not a file. For the filename (file.txt) on the command line the checksum is retrieved, and the database is queried. Kea then also output all files it knows of with that checksum - they’re shown under # Files

Metadata fields

Multiple copies

Given that Kea stores data linked to