source: project/release/4/ugarit/trunk/README.txt @ 25501

Last change on this file since 25501 was 25501, checked in by Alaric Snell-Pym, 9 years ago

ugarit: Significant README improvements, and enabled consistency check of read blocks by default, and removed warning about deletions from backend-cache.

File size: 52.5 KB
1# Introduction
3Ugarit is a backup/archival system based around content-addressible storage.
5This allows it to upload incremental backups to a remote server or a
6local filesystem such as an NFS share or a removable hard disk, yet
7have the archive instantly able to produce a full snapshot on demand
8rather than needing to download a full snapshot plus all the
9incrementals since. The content-addressible storage technique means
10that the incrementals can be applied to a snapshot on various kinds of
11storage without needing intelligence in the storage itself - so the
12snapshots can live within Amazon S3 or on a removable hard disk.
14Also, the same storage can be shared between multiple systems that all
15back up to it - and the incremental upload algorithm will mean that
16any files shared between the servers will only need to be uploaded
17once. If you back up a complete server, than go and back up another
18that is running the same distribution, then all the files in `/bin`
19and so on that are already in the storage will not need to be backed
20up again; the system will automatically spot that they're already
21there, and not upload them again.
23## So what's that mean in practice?
25You can run Ugarit to back up any number of filesystems to a shared
26archive, and on every backup, Ugarit will only upload files or parts
27of files that aren't already in the archive - be they from the
28previous snapshot, earlier snapshots, snapshot of entirely unrelated
29filesystems, etc. Every time you do a snapshot, Ugarit builds an
30entire complete directory tree of the snapshot in the archive - but
31reusing any parts of files, files, or entire directories that already
32exist anywhere in the archive, and only uploading what doesn't already
35The support for parts of files means that, in many cases, gigantic
36files like database tables and virtual disks for virtual machines will
37not need to be uploaded entirely every time they change, as the
38changed sections will be identified and uploaded.
40Because a complete directory tree exists in the archive for any
41snapshot, the extraction algorithm is incredibly simple - and,
42therefore, incredibly reliable and fast. Simple, reliable, and fast
43are just what you need when you're trying to reconstruct the
44filesystem of a live server.
46Also, it means that you can do lots of small snapshots. If you run a
47snapshot every hour, then only a megabyte or two might have changed in
48your filesystem, so you only upload a megabyte or two - yet you end up
49with a complete history of your filesystem at hourly intervals in the
52Conventional backup systems usually either store a full backup then
53incrementals to their archives, meaning that doing a restore involves
54reading the full backup then reading every incremental since and
55applying them - so to do a restore, you have to download *every
56version* of the filesystem you've ever uploaded, or you have to do
57periodic full backups (even though most of your filesystem won't have
58changed since the last full backup) to reduce the number of
59incrementals required for a restore. Better results are had from
60systems that use a special backup server to look after the archive
61storage, which accept incremental backups and apply them to the
62snapshot they keep in order to maintain a most-recent snapshot that
63can be downloaded in a single run; but they then restrict you to using
64dedicated servers as your archive stores, ruling out cheaply scalable
65solutions like Amazon S3, or just backing up to a removable USB or
66eSATA disk you attach to your system whenever you do a backup. And
67dedicated backup servers are complex pieces of software; can you rely
68on something complex for the fundamental foundation of your data
69security system?
71## System Requirements
73Ugarit should run on any POSIX-compliant system that can run [Chicken
74Scheme]( It stores and
75restores all the file attributes reported by the `stat` system call -
76POSIX mode permissions, UID, GID, mtime, and optionally atime and
77ctime (although the ctime cannot be restored due to POSIX
78restrictions). Ugarit will store files, directories, device and
79character special files, symlinks, and FIFOs.
81Support for extended filesystem attributes - ACLs, alternative
82streams, forks and other metadata - is possible, due to the extensible
83directory entry format; support for such metadata will be added as
86Currently, only local filesystem-based archive storage backends are
87complete: these are suitable for backing up to a removable hard disk
88or a filesystem shared via NFS or other protocols. However, the
89backend can be accessed via an SSH tunnel, so a remote server you are
90able to install Ugarit on to run the backends can be used as a remote
93However, the next backend to be implemented will be one for Amazon S3,
94and an SFTP backend for storing archives anywhere you can ssh
95to. Other backends will be implemented on demand; an archive can, in
96principle, be stored on anything that can store files by name, report
97on whether a file already exists, and efficiently download a file by
98name. This rules out magnetic tapes due to their requirement for
99sequential access.
101Although we need to trust that a backend won't lose data (for now), we
102don't need to trust the backend not to snoop on us, as Ugarit
103optionally encrypts everything sent to the archive.
105## Terminology
107A Ugarit backend is the software module that handles backend
108storage. An archive is an actual storage system storing actual data,
109accessed through the appropriate backend for that archive. The backend
110may run locally under Ugarit itself, or via an SSH tunnel, on a remote
111server where it is installed.
113For example, if you use the recommended "splitlog" filesystem backend,
114your archive might be `/mnt/bigdisk` on the server `prometheus`. The
115backend (which is compiled along with the other filesystem backends in
116the `backend-fs` binary) must be installed on `prometheus`, and Ugarit
117clients all over the place may then use it via ssh to
118`prometheus`. However, even with the filesystem backends, the actual
119storage might not be on `prometheus` where the backend runs -
120`/mnt/bigdisk` might be an NFS mount, or a mount from a storage-area
121network. This ability to delegate via SSH is particularly useful with
122the "cache" backend, which reduces latency by storing a cache of what
123blocks exist in a backend, thereby making it quicker to identify
124already-stored files; a cluster of servers all sharing the same
125archive might all use SSH tunnels to access an instance of the "cache"
126backend on one of them (using some local disk to store the cache),
127which proxies the actual archive storage to an archive on the other
128end of a high-latency Internet link, again via an SSH tunnel.
130## What's in an archive?
132An Ugarit archive contains a load of blocks, each up to a maximum size
133(usually 1MiB, although other backends might impose smaller
134limits). Each block is identified by the hash of its contents; this is
135how Ugarit avoids ever uploading the same data twice, by checking to
136see if the data to be uploaded already exists in the archive by
137looking up the hash. The contents of the blocks are compressed and
138then encrypted before upload.
140Every file uploaded is, unless it's small enough to fit in a single
141block, chopped into blocks, and each block uploaded. This way, the
142entire contents of your filesystem can be uploaded - or, at least,
143only the parts of it that aren't already there! The blocks are then
144tied together to create a snapshot by uploading blocks full of the
145hashes of the data blocks, and directory blocks are uploaded listing
146the names and attributes of files in directories, along with the
147hashes of the blocks that contain the files' contents. Even the blocks
148that contain lists of hashes of other blocks are subject to checking
149for pre-existence in the archive; if only a few MiB of your
150hundred-GiB filesystem has changed, then even the index blocks and
151directory blocks are re-used from previous snapshots.
153Once uploaded, a block in the archive is never again changed. After
154all, if its contents changed, its hash would change, so it would no
155longer be the same block! However, every block has a reference count,
156tracking the number of index blocks that refer to it. This means that
157the archive knows which blocks are shared between multiple snapshots
158(or shared *within* a snapshot - if a filesystem has more than one
159copy of the same file, still only one copy is uploaded), so that if a
160given snapshot is deleted, then the blocks that only that snapshot is
161using can be deleted to free up space, without corrupting other
162snapshots by deleting blocks they share. Keep in mind, however, that
163not all storage backends may support this - there are certain
164advantages to being an append-only archive. For a start, you can't
165delete something by accident! The supplied fs backend supports
166deletion, while the splitlog backend does not yet. However, the actual
167snapshot deletion command hasn't been implemented yet either, so it's
168a moot point for now...
170Finally, the archive contains objects called tags. Unlike the blocks,
171the tags contents can change, and they have meaningful names rather
172than being identified by hash. Tags identify the top-level blocks of
173snapshots within the system, from which (by following the chain of
174hashes down through the index blocks) the entire contents of a
175snapshot may be found. Unless you happen to have recorded the hash of
176a snapshot somewhere, the tags are where you find snapshots from when
177you want to do a restore!
179Whenever a snapshot is taken, as soon as Ugarit has uploaded all the
180files, directories, and index blocks required, it looks up the tag you
181have identified as the target of the snapshot. If the tag already
182exists, then the snapshot it currently points to is recorded in the
183new snapshot as the "previous snapshot"; then the snapshot header
184containing the previous snapshot hash, along with the date and time
185and any comments you provide for the snapshot, and is uploaded (as
186another block, identified by its hash). The tag is then updated to
187point to the new snapshot.
189This way, each tag actually identifies a chronological chain of
190snapshots. Normally, you would use a tag to identify a filesystem
191being backed up; you'd keep snapshotting the filesystem to the same
192tag, resulting in all the snapshots of that filesystem hanging from
193the tag. But if you wanted to remember any particular snapshot
194(perhaps if it's the snapshot you take before a big upgrade or other
195risky operation), you can duplicate the tag, in effect 'forking' the
196chain of snapshots much like a branch in a version control system.
198# Using Ugarit
200## Installation
202Install [Chicken Scheme]( using their [installation instructions](
204Ugarit can then be installed by typing (as root):
206    chicken-install ugarit
208See the [chicken-install manual]( for details if you have any trouble, or wish to install into your home directory.
210## Setting up an archive
212Firstly, you need to know the archive identifier for the place you'll
213be storing your archives. This depends on your backend. The archive
214identifier is actually the command line used to invoke the backend for
215a particular archive; communication with the archive is via standard
216input and output, which is how it's easy to tunnel via ssh.
218### Local filesystem backends
220These backends use the local filesystem to store the archives. Of
221course, the "local filesystem" on a given server might be an NFS mount
222or mounted from a storage-area network.
224#### Logfile backend
226The logfile backend works much like the original Venti system. It's
227append-only - you won't be able to delete old snapshots from a logfile
228archive, even when I implement deletion. It stores the archive in two
229sets of files; one is a log of data blocks, split at a specified
230maximum size, and the other is the metadata: an sqlite database used
231to track the location of blocks in the log files, the contents of
232tags, and a count of the logs so a filename can be chosen for a new one.
234To set up a new logfile archive, just choose where to put the two
235parts. It would be nice to put the metadata file on a different
236physical disk to the logs directory, to reduce seeking. If you only
237have one disk, you can put the metadata file in the log directory
238("metadata" is a good name).
240You can then refer to it using the following archive identifier:
242      "backend-fs splitlog ...log directory... ...metadata file... max-logfile-size"
244For most platforms, a max-logfile-size of 900000000 (900 MB) should
245suffice. For now, don't go much bigger than that on 32-bit systems
246until Chicken's `file-position` function is fixed to work with files
247>1GB in size.
249#### Filesystem backend
251The filesystem backend creates archives by storing each block or tag
252in its own file, in a directory. To keep the objects-per-directory
253count down, it'll split the files into subdirectories. Because of
254this, it uses a stupendous number of inodes (more than the filesystem
255being backed up). Only use it if you don't mind that; splitlog is much
256more efficient.
258To set up a new filesystem-backend archive, just create an empty
259directory that Ugarit will have write access to when it runs. It will
260probably run as root in order to be able to access the contents of
261files that aren't world-readable (although that's up to you), so be
262careful of NFS mounts that have `maproot=nobody` set!
264You can then refer to it using the following archive identifier:
266      "backend-fs fs ...path to directory..."
268### Proxying backends
270These backends wrap another archive identifier which the actual
271storage task is delegated to, but add some value along the way.
273### SSH tunnelling
275It's easy to access an archive stored on a remote server. The caveat
276is that the backend then needs to be installed on the remote server!
277Since archives are accessed by running the supplied command, and then
278talking to them via stdin and stdout, the archive identified needs
279only be:
281      "ssh ...hostname... '...remove archive identifier...'"
283### Cache backend
285The cache backend is used to cache a list of what blocks exist in the
286proxied backend, so that it can answer queries as to the existance of
287a block rapidly, even when the proxied backend is on the end of a
288high-latency link (eg, the Internet). This should speed up snapshots,
289as existing files are identified by asking the backend if the archive
290already has them.
292The cache backend works by storing the cache in a local sqlite
293file. Given a place for it to store that file, usage is simple:
295      "backend-cache ...path to cachefile... '...proxied archive identifier...'"
297The cache file will be automatically created if it doesn't already
298exist, so make sure there's write access to the containing directory.
302If you use a cache on an archive shared between servers, make sure
303that you either:
305 * Never delete things from the archive
309 * Make sure all access to the archive is via the same cache
311If a block is deleted from an archive, and a cache on that archive is
312not aware of the deletion (as it did not go "through" the caching
313proxy), then the cache will record that the block exists in the
314archive when it does not. This will mean that if a snapshot is made
315through the cache that would use that block, then it will be assumed
316that the block already exists in the archive when it does
317not. Therefore, the block will not be uploaded, and a dangling
318reference will result!
320Some setups which *are* safe:
322 * A single server using an archive via a cache, not sharing it with
323   anyone else.
325 * A pool of servers using an archive via the same cache.
327 * A pool of servers using an archive via one or more caches, and
328   maybe some not via the cache, where nothing is ever deleted from
329   the archive.
331 * A pool of servers using an archive via one cache, and maybe some
332   not via the cache, where deletions are only performed on servers
333   using the cache, so the cache is always aware.
335## Writing a ugarit.conf
337`ugarit.conf` should look something like this:
339      (storage <archive identifier>)
340      (hash tiger "<A secret string>")
341      [double-check]
342      [(compression [deflate|lzma])]
343      [(encryption aes <key>)]
344      [(file-cache "<path>")]
345      [(rule ...)]
347The hash line chooses a hash algorithm. Currently Tiger-192 (`tiger`),
348SHA-256 (`sha256`), SHA-384 (`sha384`) and SHA-512 (`sha512`) are
349supported; if you omit the line then Tiger will still be used, but it
350will be a simple hash of the block with the block type appended, which
351reveals to attackers what blocks you have (as the hash is of the
352unencrypted block, and the hash is not encrypted). This is useful for
353development and testing or for use with trusted archives, but not
354advised for use with archives that attackers may snoop at. Providing a
355secret string produces a hash function that hashes the block, the type
356of block, and the secret string, producing hashes that attackers who
357can snoop the archive cannot use to find known blocks. Whichever hash
358function you use, you will need to install the required Chicken egg
359with one of the following commands:
361    chicken-install -s tiger-hash  # for tiger
362    chicken-install -s sha2        # for the SHA hashes
364`double-check`, if present, causes Ugarit to perform extra internal
365consistency checks during backups, which will detect bugs but may slow
366things down.
368`lzma` is the recommended compression option for low-bandwidth
369backends or when space is tight, but it's very slow to compress;
370deflate or no compression at all are better for fast local
371archives. To have no compression at all, just remove the `(compression
372...)` line entirely. Likewise, to use compression, you need to install
373a Chicken egg:
375       chicken-install -s z3       # for deflate
376       chicken-install -s lzma     # for lzma
378Likewise, the `(encryption ...)` line may be omitted to have no
379encryption; the only currently supported algorithm is aes (in CBC
380mode) with a key given in hex, as a passphrase (hashed to get a key),
381or a passphrase read from the terminal on every run. The key may be
38216, 24, or 32 bytes for 128-bit, 192-bit or 256-bit AES. To specify a
383hex key, just supply it as a string, like so:
385      (encryption aes "00112233445566778899AABBCCDDEEFF")
387...for 128-bit AES,
389      (encryption aes "00112233445566778899AABBCCDDEEFF0011223344556677")
391...for 192-bit AES, or
393      (encryption aes "00112233445566778899AABBCCDDEEFF00112233445566778899AABBCCDDEEFF")
395...for 256-bit AES.
397Alternatively, you can provide a passphrase, and specify how large a
398key you want it turned into, like so:
400      (encryption aes ([16|24|32] "We three kings of Orient are, one in a taxi one in a car, one on a scooter honking his hooter and smoking a fat cigar. Oh, star of wonder, star of light; star with royal dynamite"))
402Finally, the extra-paranoid can request that Ugarit prompt for a
403passphrase on every run and hash it into a key of the specified
404length, like so:
406      (encryption aes ([16|24|32] prompt))
408(note the lack of quotes around `prompt`, distinguishing it from a passphrase)
410Again, as it is an optional feature, to use encryption, you must
411install the appropriate Chicken egg:
413       chicken-install -s aes
415A file cache, if enabled, significantly speeds up subsequent snapshots
416of a filesystem tree. The file cache is a file (which Ugarit will
417create if it doesn't already exist) mapping filenames to
418(mtime,size,hash) tuples; as it scans the filesystem, if it finds a
419file in the cache and the mtime and size have not changed, it will
420assume it is already archived under the specified hash. This saves it
421from having to read the entire file to hash it and then check if the
422hash is present in the archive. In other words, if only a few files
423have changed since the last snapshot, then snapshotting a directory
424tree becomes an O(N) operation, where N is the number of files, rather
425than an O(M) operation, where M is the total size of files involved.
427For example:
429      (storage "ssh ugarit@spiderman 'backend-fs splitlog /mnt/ugarit-data /mnt/ugarit-metadata/metadata 900000000'")
430      (hash tiger "Giung0ahKahsh9ahphu5EiGhAhth4eeyDahs2aiWAlohr6raYeequ8uiUr3Oojoh")
431      (encryption aes (32 "deing2Aechediequohdo6Thuvu0OLoh6fohngio9koush9euX6el9iesh6Aef4augh3WiY7phahmesh2Theeziniem5hushai5zigushohnah1quae1ooXo0eingu1Aifeo1eeSheaz9ieSie9tieneibeiPho0quu6um8weiyagh4kaeshooThooNgeyoul2Ahsahgh8imohw3hoyazai9gaph5ohhaechiedeenusaeghahghipe8ii3oo9choh5cieth5iev3jiedohquai4Thiedah5sah5kohcepheixai3aiPainozooc6zohNeiy6Jeigeesie5eithoo0ciiNae8Nee3eiSuKaiza0VaiPai2eeFooNgeengaif9yaiv9rathuoQuohy0ohth6OiL9aisaetheeWoh9aiQu0yoo6aequ3quoiChi7joonohwuvaipeuh2eiPoogh1Ie8tiequesoshaeBue5ieca8eerah0quieJoNoh3Jiesh1chei8weidixeen1yah1ioChie0xaimahWeeriex5eetiichahP9iey5ux7ahGhei7eejahxooch5eiqu0Pheir9Reiri4ahqueijuchae8eeyieMeixa4ciisioloe9oaroof1eegh4idaeNg5aepeip8mah7ixaiSohtoxaiH4oe5eeGoh4eemu7mee8ietaecu6Zoodoo0hoP5uquaish2ahc7nooshi0Aidae2Zee4pheeZee3taerae6Aepu2Ayaith2iivohp8Wuikohvae2Peange6zeihep8eC9mee8johshaech1Ubohd4Ko5caequaezaigohyai1TheeN6Gohva6jinguev4oox2eet5auv0aiyeo7eJieGheebaeMahshifaeDohy8quut4ueFei3eiCheimoechoo2EegiveeDah1sohs7ezee3oaWa2iiv2Chi1haiS5ahph4phu5su0hiocee3ooyaeghang7sho7maiXeo5aex"))
432      (compression lzma)
433      (file-cache "/var/ugarit/cache")
435Be careful to put a set of parentheses around each configuration
436entry. White space isn't significant, so feel free to indent things
437and wrap them over lines if you want.
439Keep copies of this file safe - you'll need it to do extractions!
440Print a copy out and lock it in your fire safe! Ok, currently, you
441might be able to recreate it if you remember where you put the
442storage, but encryption keys are harder to remember.
444## Your first backup
446Think of a tag to identify the filesystem you're backing up. If it's
447`/home` on the server `gandalf`, you might call it `gandalf-home`. If
448it's the entire filesystem of the server `bilbo`, you might just call
449it `bilbo`.
451Then from your shell, run (as root):
453      # ugarit snapshot <ugarit.conf> [-c] [-a] <tag> <path to root of filesystem>
455For example, if we have a `ugarit.conf` in the current directory:
457      # ugarit snapshot ugarit.conf -c localhost-etc /etc
459Specify the `-c` flag if you want to store ctimes in the archive;
460since it's impossible to restore ctimes when extracting from an
461archive, doing this is useful only for informational purposes, so it's
462not done by default. Similarly, atimes aren't stored in the archive
463unless you specify `-a`, because otherwise, there will be a lot of
464directory blocks uploaded on every snapshot, as the atime of every
465file will have been changed by the previous snapshot - so with `-a`
466specified, on every snapshot, every directory in your filesystem will
467be uploaded! Ugarit will happily restore atimes if they are found in
468an archive; their storage is made optional simply because uploading
469them is costly and rarely useful.
471## Exploring the archive
473Now you have a backup, you can explore the contents of the
474archive. This need not be done as root, as long as you can read
475`ugarit.conf`; however, if you want to extract files, run it as root
476so the uids and gids can be set.
478      $ ugarit explore <ugarit.conf>
480This will put you into an interactive shell exploring a virtual
481filesystem. The root directory contains an entry for every tag; if you
482type `ls` you should see your tag listed, and within that tag, you'll
483find a list of snapshots, in descending date order, with a special
484entry `current` for the most recent snapshot. Within a snapshot,
485you'll find the root directory of your snapshot, and will be able to
486`cd` into subdirectories, and so on:
488      > ls
489      Test <tag>
490      > cd Test
491      /Test> ls
492      2009-01-24 10:28:16 <snapshot>
493      2009-01-24 10:28:16 <snapshot>
494      current <snapshot>
495      /Test> cd current
496      /Test/current> ls
497      README.txt <file>
498      LICENCE.txt <symlink>
499      subdir <dir>
500      .svn <dir>
501      FIFO <fifo>
502      chardev <character-device>
503      blockdev <block-device>
504      /Test/current> ls -ll LICENCE.txt
505      lrwxr-xr-x 1000 100 2009-01-15 03:02:49 LICENCE.txt -> subdir/LICENCE.txt
506      target: subdir/LICENCE.txt
507      ctime: 1231988569.0
509As well as exploring around, you can also extract files or directories
510(or entire snapshots) by using the `get` command. Ugarit will do its
511best to restore the metadata of files, subject to the rights of the
512user you run it as.
514Type `help` to get help in the interactive shell.
516## Duplicating tags
518As mentioned above, you can duplicate a tag, creating two tags that
519refer to the same snapshot and its history but that can then have
520their own subsequent history of snapshots applied to each
521independently, with the following command:
523      $ ugarit fork <ugarit.conf> <existing tag> <new tag>
525## `.ugarit` files
527By default, Ugarit will archive everything it finds in the filesystem
528tree you tell it to snapshot. However, this might not always be
529desired; so we provide the facility to override this with `.ugarit`
530files, or global rules in your `.conf` file.
532Note: The syntax of these files is provisional, as I want to
533experiment with usability, as the current syntax is ugly. So please
534don't be surprised if the format changes in incompatible ways in
535subsequent versions!
537In quick summary, if you want to ignore all files or directories
538matching a glob in the current directory and below, put the following
539in a `.ugarit` file in that directory:
541      (* (glob "*~") exclude)
543You can write quite complex expressions as well as just globs. The
544full set of rules is:
546* `(glob "`*pattern*`")` matches files and directories whose names
547  match the glob pattern
549* `(name "`*name*`")` matches files and directories with exactly that
550  name (useful for files called `*`...)
552* `(modified-within ` *number* ` seconds)` matches files and
553  directories modified within the given number of seconds
555* `(modified-within ` *number* ` minutes)` matches files and
556  directories modified within the given number of minutes
558* `(modified-within ` *number* ` hours)` matches files and directories
559  modified within the given number of hours
561* `(modified-within ` *number* ` days)` matches files and directories
562  modified within the given number of days
564* `(not ` *rule*`)` matches files and directories that do not match
565  the given rule
567* `(and ` *rule* *rule...*`)` matches files and directories that match
568  all the given rules
570* `(or ` *rule* *rule...*`)` matches files and directories that match
571  any of the given rules
573Also, you can override a previous exclusion with an explicit include
574in a lower-level directory:
576    (* (glob "*~") include)
578You can bind rules to specific directories, rather than to "this
579directory and all beneath it", by specifying an absolute or relative
580path instead of the `*`:
582    ("/etc" (name "passwd") exclude)
584If you use a relative path, it's taken relative to the directory of
585the `.ugarit` file.
587You can also put some rules in your `.conf` file, although relative
588paths are illegal there, by adding lines of this form to the file:
590    (rule * (glob "*~") exclude)
592# Questions and Answers
594## What happens if a snapshot is interrupted?
596Nothing! Whatever blocks have been uploaded will be uploaded, but the
597snapshot is only added to the tag once the entire filesystem has been
598snapshotted. So just start the snapshot again. Any files that have
599already be uploaded will then not need to be uploaded again, so the
600second snapshot should proceed quickly to the point where it failed
601before, and continue from there.
603Unless the archive ends up with a partially-uploaded corrupted block
604due to being interrupted during upload, you'll be fine. The filesystem
605backend has been written to avoid this by writing the block to a file
606with the wrong name, then renaming it to the correct name when it's
607entirely uploaded.
609## Should I share a single large archive between all my filesystems?
611I think so. Using a single large archive means that blocks shared
612between servers - eg, software installed from packages and that sort
613of thing - will only ever need to be uploaded once, saving storage
614space and upload bandwidth.
616# Security model
618I have designed and implemented Ugarit to be able to handle cases
619where the actual archive storage is not entirely trusted.
621However, security involves tradeoffs, and Ugarit is configurable in
622ways that affect its resistance to different kinds of attacks. Here I
623will list different kinds of attack and explain how Ugarit can deal
624with them, and how you need to configure it to gain that
627## Archive snoopers
629This might be somebody who can intercept Ugarit's communication with
630the archive at any point, or who can read the archive itself at their
633### Reading your data
635If you enable encryption, then all the blocks sent to the archive are
636encrypted using a secret key stored in your Ugarit configuration
637file. As long as that configuration file is kept safe, and the AES
638algorithm is secure, then attackers who can snoop the archive cannot
639decode your data blocks. Enabling compression will also help, as the
640blocks are compressed before encrypting, which is thought to make
641cryptographic analysis harder.
643Recommendations: Use compression and encryption when there is a risk
644of archive snooping. Keep your Ugarit configuration file safe using
645UNIX file permissions (make it readable only by root), and maybe store
646it on a removable device that's only plugged in when
647required. Alternatively, use the "prompt" passphrase option, and be
648prompted for a passphrase every time you run Ugarit, so it isn't
649stored on disk anywhere.
651### Looking for known hashes
653A block is identified by the hash of its content (before compression
654and encryption). If an attacker was trying to find people who own a
655particular file (perhaps a piece of subversive literature), they could
656search Ugarit archives for its hash.
658However, Ugarit has the option to "key" the hash with a "salt" stored
659in the Ugarit configuration file. This means that the hashes used are
660actually a hash of the block's contents *and* the salt you supply. If
661you do this with a random salt that you keep secret, then attackers
662can't check your archive for known content just by comparing the hashes.
664Recommendations: Provide a secret string to your hash function in your
665Ugarit configuration file. Keep the Ugarit configuration file safe, as
666per the advice in the previous point.
668## Archive modifiers
670These folks can modify Ugarit's writes into the archive, its reads
671back from the archive, or can modify the archive itself at their leisure.
673Modifying an encrypted block without knowing the encryption key can at
674worst be a denial of service, corrupting the block in an unknown
675way. An attacker who knows the encryption key could replace a block
676with valid-seeming but incorrect content. In the worst case, this
677could exploit a bug in the decompression engine, causing a crash or
678even an exploit of the Ugarit process itself (thereby gaining the
679powers of a process inspector, as documented below). We can but hope
680that the decompression engine is robust. Exploits of the decryption
681engine, or other parts of Ugarit, are less likely due to the nature of
682the operations performed upon them.
684However, if a block is modified, then when Ugarit reads it back, the
685hash will no longer match the hash Ugarit requested, which will be
686detected and an error reported. The hash is checked after
687decryption and decompression, so this check does not protect us
688against exploits of the decompression engine.
690This protection is only afforded when the hash Ugarit asks for is not
691tampered with. Most hashes are obtained from within other blocks,
692which are therefore safe unless that block has been tampered with; the
693nature of the hash tree conveys the trust in the hashes up to the
694root. The root hashes are stored in the archive as "tags", which an
695archive modifier could alter at will. Therefore, the tags cannot be
696trusted if somebody might modify the archive. This is why Ugarit
697prints out the snapshot hash and the root directory hash after
698performing a snapshot, so you can record them securely outside of the
701The most likely threat posed by archive modifiers is that they could
702simply corrupt or delete all of your archive, without needing to know
703any encryption keys.
705Recommendations: Secure your archives against modifiers, by whatever
706means possible. If archive modifiers are still a potential threat,
707write down a log of your root directory hashes from each snapshot, and keep
708it safe. When extracting your backups, use the `ls -ll` command in the
709interface to check the "contents" hash of your snapshots, and check
710they match the root directory hash you expect.
712## Process inspectors
714These folks can attach debuggers or similar tools to running
715processes, such as Ugarit itself.
717Ugarit backend processes only see encrypted data, so people who can
718attach to that process gain the powers of archive snoopers and
719modifiers, and the same conditions apply.
721People who can attach to the Ugarit process itself, however, will see
722the original unencrypted content of your filesystem, and will have
723full access to the encryption keys and hashing keys stored in your
724Ugarit configuration. When Ugarit is running with sufficient
725permissions to restore backups, they will be able to intercept and
726modify the data as it comes out, and probably gain total write access
727to your entire filesystem in the process.
729Recommendations: Ensure that Ugarit does not run under the same user
730ID as untrusted software. In many cases it will need to run as root in
731order to gain unfettered access to read the filesystems it is backing
732up, or to restore the ownership of files. However, when all the files
733it backs up are world-readable, it could run as an untrusted user for
734backups, and where file ownership is trivially reconstructible, it can
735do restores as a limited user, too.
737## Attackers in the source filesystem
739These folks create files that Ugarit will back up one day. By having
740write access to your filesystem, they already have some level of
741power, and standard Unix security practices such as storage quotas
742should be used to control them. They may be people with logins on your
743box, or more subtly, people who can cause servers to writes files;
744somebody who sends an email to your mailserver will probably cause
745that message to be written to queue files, as will people who can
746upload files via any means.
748Such attackers might use up your available storage by creating large
749files. This creates a problem in the actual filesystem, but that
750problem can be fixed by deleting the files. If those files get
751archived into Ugarit, then they are a part of that snapshot. If you
752are using a backend that supports deletion, then (when I implement
753snapshot deletion in the user interface) you could delete that entire
754snapshot to recover the wasted space, but that is a rather serious
757More insidiously, such attackers might attempt to abuse a hash
758collision in order to fool the archive. If they have a way of creating
759a file that, for instance, has the same hash as your shadow password
760file, then Ugarit will think that it already has that file when it
761attempts to snapshot it, and store a reference to the existing
762file. If that snapshot is restored, then they will receive a copy of
763your shadow password file. Similarly, if they can predict a future
764hash of your shadow password file, and create a shadow password file
765of their own (perhaps one giving them a root account with a known
766password) with that hash, they can then wait for the real shadow
767password file to have that hash. If the system is later restored from
768that snapshot, then their chosen content will appear in the shadow
769password file. However, doing this requires a very fundamental break
770of the hash function being used.
772Recommendations: Think carefully about who has write access to your
773filesystems, directly or indirectly via a network service that stores
774received data to disk. Enforce quotas where appropriate, and consider
775not backing up "queue directories" where untrusted content might
776appear; migrate incoming content that passes acceptance tests to an
777area that is backed up. If necessary, the queue might be backed up to
778a non-snapshotting system, such as rsyncing to another server, so that
779any excessive files that appear in there are removed from the backup
780in due course, while still affording protection.
782# Future Directions
784Here's a list of planned developments, in approximate priority order:
786## General
788* More checks with `double-check` mode activated. Perhaps read blocks
789  back from the archive to check it matches the blocks sent, to detect
790  hash collisions. Maybe have levels of double-check-ness.
792* Everywhere I use (sql ...) to create an sqlite prepared statement,
793  don't. Create them all up-front and reuse the resulting statement
794  objects, it'll save memory and time.
796* Migrate the source repo to Fossil (when there's a
797 migration to Fossil), and update the egg
798  locations thingy.
800## Backends
802* Look at - can this help?
804* Extend the backend protocol with a special "admin" command that
805  allows for arbitrary backend-specific operations, and write an
806  ugarit-backend-admin CLI tool to administer backends with it. The
807  input should be a single s-expression as a list, and the result
808  should be an alist which is displayed to the user in a friendly
809  manner, as "Key: Value\n" lines.
811* Implement "info" admin commands for all backends, that list any
812  available stats, and at least the backend type and parameters.
814* Support for recreating the index and tags on a backend-splitlog if
815  they get corrupted, from the headers left in the log, as a "reindex"
816  admin command.
818* Support for flushing the cache on a backend-cache, via an admin
819  command.
821* Support for unlinking in backend-splitlog, by marking byte ranges as
822  unused in the metadata (and by touching the headers in the log so we
823  maintain the invariant that the metadata is a reconstructible cache)
824  and removing the entries for the unlinked blocks, perhaps provide an
825  option to attempt to re-use existing holes to put blocks in for
826  online reuse, and provide an offline compaction operation. Keep
827  stats in the index of how many byte ranges are unused, and how many
828  bytes unused, in each file, and report them in the info admin
829  interface, along with the option to compact any or all files.
831* Have read-only and unlinkable config flags in the backend-split
832  metadata file, settable via admin commands.
834* Optional support in backends for keeping a log of tag changes, and
835  admin commands to read the log.
837* Support for SFTP as a storage backend. Store one file per block, as
838  per `backend-fs`, but remotely. See
839 for sftp
840  protocol specs; popen an `ssh -p sftp` connection to the server then
841  talk that simple binary protocol. Tada!
843* Support for S3 as a storage backend. There is now an S3 egg!
845* Support for replicated archives. This will involve a special storage
846  backend that can wrap any number of other archives, each tagged with
847  a trust percentage and read and write load weightings. Each block
848  will be uploaded to enough archives to make the total trust be at
849  least 100%, by randomly picking the archives weighted by their write
850  load weighting. A read-only archive automatically gets its write
851  load weighting set to zero, and a warning issued if it was
852  configured otherwise. A local cache will be kept of which backends
853  carry which blocks, and reads will be serviced by picking the
854  archive that carries it and has the highest read load weighting. If
855  that archive is unavailable or has lost the block, then they will be
856  tried in read load order; and if none of them have it, an exhaustive
857  search of all available archives will be performed before giving up,
858  and the cache updated with the results if the block is found. In
859  order to correctly handle archives that were unavailable during
860  this, we might need to log an "unknown" for that block key / archive
861  pair, rather than assuming the block is not there, and check it
862  later. Users
863  will be given an admin command to notify the backend of an archive
864  going missing forever, which will cause it to be removed from the
865  cache. Affected blocks should be examined and re-replicated if their
866  replication count is now too low. Another command should be
867  available to warn of impending deliberate removal, which will again
868  remove the archive from the cluster and re-replicate, the difference
869  being that the disappearing archive is usable for re-replicating
870  FROM, so this is a safe operation for blocks that are only on that
871  one archive. The individual physical archives
872  that we put replication on top of won't be "valid" archives unless
873  they are 100% replicated, as they'll contain references to blocks
874  that are on other archives. It might be a good idea to mark them as
875  such with a special tag to avoid people trying to restore directly
876  from them. A copy of the replication configuration could be stored
877  under a special tag to mark this fact, and to enable easy finding of
878  the proper replicated archive to work from. There should be a
879  configurable option to snapshot the cache to the archives whenever
880  the replicated archive is closed, too. The command line to the
881  backend, "backend-replicated", should point to an sqlite file for
882  the configuration and cache, and users should use admin commands to
883  add/remove/modify archives in the cluster.
885## Core
887* API documentation for the units we export
889* More `.ugarit` actions. Right now we just have exclude and include;
890  we might specify less-safe operations such as commands to run before
891  and after snapshotting certain subtrees, or filters (don't send this
892  SVN repository; instead send the output of `svnadmin dump`),
893  etc. Running arbitrary commands is a security risk if random users
894  write their own `.ugarit` files - so we'd need some trust-based
895  mechanism; they'd need to be explicitly enabled in `ugarit.conf`,
896  then a `.ugarit` option could disable all unsafe operations in a
897  subtree.
899* Support for FFS flags, Mac OS X extended filesystem attributes, NTFS
900  ACLs/streams, FAT attributes, etc... Ben says to look at Box Backup
901  for some code to do that sort of thing.
903* Implement lock-tag! etc. in backend-fs, as a precaution against two
904  concurrent snapshots racing over updating the tag, where concurrent
905  access to the archive is even possible.
907* Deletion support - letting you remove snapshots. Perhaps you might
908  want to remove all snapshots older than a given number of days on a
909  given tag. Or just remove X out of Y snapshots older than a given
910  number of days on a given tag. We have the core support for this;
911  just find a snapshot and `unlink-directory!` it, leaving a dangling
912  pointer from the snapshot, and write the snapshot handling code to
913  expect this. Again, check Box Backup for that.
915* Some kind of accounting for storage usage by snapshot. It'd be nice
916  to track, as we write a snapshot to the archive, how many bytes we
917  reuse and how many we back up. We can then store this in the
918  snapshot metadata, and so report them somewhere. The blocks uploaded
919  by a snapshot may well then be reused by other snapshots later on,
920  so it wouldn't be a true measure of 'unique storage', nor a measure
921  of what you'd reclaim by deleting that snapshot, but it'd be
922  interesting anyway.
924* Option, when backing up, to not cross mountpoints
926* Option, when backing up, to store inode number and mountpoint path
927  in directory entries, and then when extracting, keeping a dictionary
928  of this unique identifier to pathname, so that if a file to be
929  extracted is already in the dictionary and the hash is the same, a
930  hardlink can be created.
932* Archival mode as well as snapshot mode. Whereas a snapshot record
933  takes a filesystem tree and adds it to a chain of snapshots of the
934  same filesystem tree, archival mode takes a filesystem tree and
935  inserts it into a search tree anchored on the specified tag,
936  indexing it on a list of key+value properties supplied at archival
937  time. An archive tag is represented in the virtual filesystem as a
938  directory full of archive objects, each identified by their full
939  hash; each archive object references the filesystem root as well as
940  the key+value properties, and optionally a parent link like a
941  snapshot, as an archive can be made that explicitly replaces an
942  earlier one and should replace it in the index; there is also a
943  virtual directory for each indexed property which contains a
944  directory for each value of the property, full of symlinks to the
945  archive objects, and subdirectories that allow multi-property
946  searches on other properties. The index itself is stored as a B-Tree
947  with a reasonably small block size; when it's updated, the modified
948  index blocks are replaced, thereby gaining new hashes, so their
949  parents need replacing, all the way up the tree until a new root
950  block is created. The existing block unlink mechanism in the
951  backends will reclaim storage for blocks that are superceded, if the
952  backend supports it. When this is done, ugarit will offer the option
953  of snapshotting to a snapshot tag, or archiving to an archive tag,
954  or archiving to an archive tag while replacing a specified archive
955  object (nominated by path within the tag), which causes it to be
956  removed from the index (except from the directory listing all
957  archives by hash), and the new archive object is inserted,
958  referencing the old one as a parent.
960* Dump/restore format. On a dump, walk an arbitrary subtree of an
961  archive, serialising objects. Do not put any hashes in the dump
962  format - dump out entire files, and just identify objects with
963  sequential numbers when forming the directory / snapshot trees. On a
964  restore, read the same format and slide it into an archive (creating
965  any required top-level snapshot objects if the dump doesn't start
966  from a snapshot) and putting it onto a specified tag. The
967  intension is that this format can be used to migrate your stuff
968  between archives, perhaps to change to a better backend.
970## Front-end
972* Better error messages
974* Line editing in the "explore" CLI, ideally with tab completion
976* API mode: Works something like the backend API, except at the
977  archive level. Supports all the important archive operations, plus
978  access to sexpr stream writers and key stream writers,
979  archive-node-fold, etc. Requested by andyjpb, perhaps I can write
980  the framework for this and then let him add API functions as he desires.
982* Command-line support to extract the contents of a given path in the
983  archive, rather than needing to use explore mode. Also the option to
984  extract given just a block key (useful when reading from keys logged
985  manually at snapshot time, or from a backend that has a tag log).
987* FUSE support. Mount it as a read-only filesystem :-D Then consider
988  adding Fossil-style writing to the `current` of a snapshot, with
989  copy-on-write of blocks to a buffer area on the local disk, then the
990  option to make a snapshot of `current`.
992* Filesystem watching. Even with the hash-caching trick, a snapshot
993  will still involve walking the entire directory tree and looking up
994  every file in the hash cache. We can do better than that - some
995  platforms provide an interface for receiving real-time notifications
996  of changed or added files. Using this, we could allow ugarit to run
997  in continuous mode, keeping a log of file notifications from the OS
998  while it does an initial full snapshot. It can then wait for a
999  specified period (one hour, perhaps?), accumulating names of files
1000  changed since it started, before then creating a new snapshot by
1001  uploading just the files it knows to have changed, while subsequent
1002  file change notifications go to a new list.
1004## Testing
1006* An option to verify a snapshot, walking every block in it checking
1007  there's no dangling references, and that everything matches its
1008  hash, without needing to put it into a filesystem, and applying any
1009  other sanity checks we can think of en route. Optionally compare it
1010  to an on-disk filesystem, while we're at it.
1012* A unit test script around the `ugarit` command-line tool; the corpus
1013  should contain a mix of tiny and huge files and directories, awkward
1014  cases for sharing of blocks (many identical files in the same dir,
1015  etc), complex forms of file metadata, and so on. It should archive
1016  and restore the corpus several times over with each hash,
1017  compression, and encryption option.
1019# Acknowledgements
1021The original idea came from Venti, a content-addressed storage system
1022from Plan 9. Venti is usable directly by user applications, and is
1023also integrated with the Fossil filesystem to support snapshotting the
1024status of a Fossil filesystem. Fossil allows references to either be
1025to a block number on the Fossil partition or to a Venti key; so when a
1026filesystem has been snapshotted, all it now contains is a "root
1027directory" pointer into the Venti archive, and any files modified
1028therafter are copied-on-write into Fossil where they may be modified
1029until the next snapshot.
1031We're nowhere near that exciting yet, but using FUSE, we might be able
1032to do something similar, which might be fun. However, Venti inspired
1033me when I read about it years ago; it showed me how elegant
1034content-addressed storage is. Finding out that the Git version control
1035system used the same basic tricks really just confirmed this for me.
1037Also, I'd like to tip my hat to Duplicity. With the changing economics
1038of storage presented by services like Amazon S3 and, I
1039looked to Duplicity as it provided both SFTP and S3 backends. However,
1040it worked in terms of full and incremental backups, a model that I
1041think made sense for magnetic tapes, but loses out to
1042content-addressed snapshots when you have random-access
1043media. Duplicity inspired me by its adoption of multiple backends, the
1044very backends I want to use, but I still hungered for a
1045content-addressed snapshot store.
1047I'd also like to tip my hat to Box Backup. I've only used it a little,
1048because it requires a special server to manage the storage (and I want
1049to get my backups *off* of my servers), but it also inspires me with
1050directions I'd like to take Ugarit. It's much more aware of real-time
1051access to random-access storage than Duplicity, and has a very
1052interesting continuous background incremental backup mode, moving away
1053from the tape-based paradigm of backups as something you do on a
1054special day of the week, like some kind of religious observance. I
1055hope the author Ben, who is a good friend of mine, won't mind me
1056plundering his source code for details on how to request real-time
1057notification of changes from the filesystem, and how to read and write
1058extended attributes!
1060Moving on from the world of backup, I'd like to thank the Chicken Team
1061for producing Chicken Scheme. Felix and the community at #chicken on
1062Freenode have particularly inspired me with their can-do attitudes to
1063combining programming-language elegance and pragmatic engineering -
1064two things many would think un-unitable enemies. Of course, they
1065didn't do it all themselves - R5RS Scheme and the SRFIs provided a
1066solid foundation to build on, and there's a cast of many more in the
1067Chicken community, working on other bits of Chicken or just egging
1068everyone on. And I can't not thank Henry Baker for writing the seminal
1069paper on the technique Chicken uses to implement full tail-calling
1070Scheme with cheap continuations on top of C; Henry already had my
1071admiration for his work on combining elegance and pragmatism in linear
1072logic. Why doesn't he return my calls? I even sent flowers.
1074A special thanks should go to Christian Kellermann for porting Ugarit
1075to use Chicken 4 modules, too, which was otherwise a big bottleneck to
1076development, as I was stuck on Chicken 3 for some time!
1078Thanks to the early adopters who brought me useful feedback, too!
1080And I'd like to thank my wife for putting up with me spending several
1081evenings and weekends and holiday days working on this thing...
1083# Version history
1085* 1.1: Consistency check on read blocks by default. Removed warning
1086  about deletions from backend-cache; we need a new mechanism to report
1087  warnings from backends.
1089* 1.0: Migrated from gdbm to sqlite for metadata storage, removing the
1090  GPL taint. Unit test suite. backend-cache made into a separate
1091  backend binary. Removed backend-log. BUGFIX: file caching uses mtime *and*
1092  size now, rather than just mtime. Error handling so we skip objects
1093  that we cannot do something with, and proceed to try the rest of the
1094  operation.
1096* 0.8: decoupling backends from the core and into separate binaries,
1097  accessed via standard input and output, so they can be run over SSH
1098  tunnels and other such magic.
1100* 0.7: file cache support, sorting of directories so they're archived
1101  in canonical order, autoloading of hash/encryption/compression
1102  modules so they're not required dependencies any more.
1104* 0.6: .ugarit support.
1106* 0.5: Keyed hashing so attackers can't tell what blocks you have,
1107  markers in logs so the index can be reconstructed, sha2 support, and
1108  passphrase support.
1110* 0.4: AES encryption.
1112* 0.3: Added splitlog backend, and fixed a .meta file typo.
1114* 0.2: Initial public release.
1116* 0.1: Internal development release.
Note: See TracBrowser for help on using the repository browser.