Changeset 25479 in project for release/4/ugarit/trunk/README.txt


Ignore:
Timestamp:
11/07/11 10:43:08 (10 years ago)
Author:
Alaric Snell-Pym
Message:

ugarit: Dotting is, crossing ts...

File:
1 edited

Legend:

Unmodified
Added
Removed
  • release/4/ugarit/trunk/README.txt

    r25478 r25479  
    6969You can then refer to it using the following archive identifier:
    7070
    71       fs "...path to directory..."
    72 
    73 ### New Logfile backend
    74 
    75 The logfile backend works much like the original Venti system. It's append-only - you won't be able to delete old snapshots from a logfile archive, even when I implement deletion. It stores the archive in two sets of files; one is a log of data blocks, split at a specified maximum size, and the other is the metadata: a GDBM file used as an index to locate blocks in the logfiles and to store the blocks' types, a GDBM file of tags, and a counter file used in naming logfiles.
    76 
    77 To set up a new logfile archive, just choose where to put the two sets of files. It would be nice to put the metadata on a different physical disk to the logs, to reduce seeking. Create a directory for each, or if you only have one disk, you can put them all in the same directory.
     71      "backend-fs fs ...path to directory..."
     72
     73### Logfile backend
     74
     75The logfile backend works much like the original Venti system. It's
     76append-only - you won't be able to delete old snapshots from a logfile
     77archive, even when I implement deletion. It stores the archive in two
     78sets of files; one is a log of data blocks, split at a specified
     79maximum size, and the other is the metadata: an sqlite database used
     80to track the location of blocks in the log files, the contents of
     81tags, and a count of the logs so a filename can be chosen for a new one.
     82
     83To set up a new logfile archive, just choose where to put the two
     84parts. It would be nice to put the metadata file on a different
     85physical disk to the logs directory, to reduce seeking. If you only
     86have one disk, you can put the metadata file in the log directory
     87("metadata" is a good name).
    7888
    7989You can then refer to it using the following archive identifier:
    8090
    81       splitlog "...log directory..." "...metadata directory..." max-logfile-size
     91      "backend-fs splitlog ...log directory... ...metadata file... max-logfile-size"
    8292
    8393For most platforms, a max-logfile-size of 900000000 (900 MB) should suffice. For now, don't go much bigger than that on 32-bit systems until Chicken's `file-position` function is fixed to work with files >1GB in size.
     
    96106The hash line chooses a hash algorithm. Currently Tiger-192 (`tiger`), SHA-256 (`sha256`), SHA-384 (`sha384`) and SHA-512 (`sha512`) are supported; if you omit the line then Tiger will still be used, but it will be a simple hash of the block with the block type appended, which reveals to attackers what blocks you have (as the hash is of the unencrypted block, and the hash is not encrypted). This is useful for development and testing or for use with trusted archives, but not advised for use with archives that attackers may snoop at. Providing a secret string produces a hash function that hashes the block, the type of block, and the secret string, producing hashes that attackers who can snoop the archive cannot use to find known blocks. Whichever hash function you use, you will need to install the required Chicken egg with one of the following commands:
    97107
    98     sudo chicken-install tiger-hash  # for tiger
    99     sudo chicken-install sha2        # for the SHA hashes
     108    chicken-install -s tiger-hash  # for tiger
     109    chicken-install -s sha2        # for the SHA hashes
    100110
    101111`lzma` is the recommended compression option for low-bandwidth backends or when space is tight, but it's very slow to compress; deflate or no compression at all are better for fast local archives. To have no compression at all, just remove the `(compression ...)` line entirely. Likewise, to use compression, you need to install a Chicken egg:
    102112
    103        sudo chicken-install z3       # for deflate
    104        sudo chicken-install lzma     # for lzma
     113       chicken-install -s z3       # for deflate
     114       chicken-install -s lzma     # for lzma
    105115
    106116Likewise, the `(encryption ...)` line may be omitted to have no encryption; the only currently supported algorithm is aes (in CBC mode) with a key given in hex, as a passphrase (hashed to get a key), or a passphrase read from the terminal on every run. The key may be 16, 24, or 32 bytes for 128-bit, 192-bit or 256-bit AES. To specify a hex key, just supply it as a string, like so:
    107117
    108118      (encryption aes "00112233445566778899AABBCCDDEEFF")
    109      
     119
    110120...for 128-bit AES,
    111121
     
    130140Again, as it is an optional feature, to use encryption, you must install the appropriate Chicken egg:
    131141
    132        sudo chicken-install aes
    133 
    134 A file cache, if enabled, significantly speeds up subsequent snapshots of a filesystem tree. The file cache is a file (which Ugarit will create if it doesn't already exist) mapping filenames to (mtime,hash) pairs; as it scans the filesystem, if it files a file in the cache and the mtime has not changed, it will assume it is already archived under the specified hash. This saves it from having to read the entire file to hash it and then check if the hash is present in the archive. In other words, if only a few files have changed since the last snapshot, then snapshotting a directory tree becomes an O(N) operation, where N is the number of files, rather than an O(M) operation, where M is the total size of files involved.
     142       chicken-install -s aes
     143
     144A file cache, if enabled, significantly speeds up subsequent snapshots
     145of a filesystem tree. The file cache is a file (which Ugarit will
     146create if it doesn't already exist) mapping filenames to
     147(mtime,hash,size) tuples; as it scans the filesystem, if it files a
     148file in the cache and the mtime and size have not changed, it will
     149assume it is already archived under the specified hash. This saves it
     150from having to read the entire file to hash it and then check if the
     151hash is present in the archive. In other words, if only a few files
     152have changed since the last snapshot, then snapshotting a directory
     153tree becomes an O(N) operation, where N is the number of files, rather
     154than an O(M) operation, where M is the total size of files involved.
    135155
    136156For example:
     
    143163Be careful to put a set of parentheses around each configuration entry. White space isn't significant, so feel free to indent things and wrap them over lines if you want.
    144164
    145 Keep copies of this file safe - you'll need it to do extractions! Print a copy out and lock it in your fire safe! Ok, currently, you might be able to recreate it if you remember where you put the storage, but when I add the `(encryption ...)` option, there will be an encryption key to deal with as well.
     165Keep copies of this file safe - you'll need it to do extractions!
     166Print a copy out and lock it in your fire safe! Ok, currently, you
     167might be able to recreate it if you remember where you put the
     168storage, but encryption keys are harder to remember.
    146169
    147170## Your first backup
     
    250273Here's a list of planned developments, in approximate priority order:
    251274
     275## General
     276
     277* Everywhere I use (sql ...) to create an sqlite prepared statement,
     278  don't. Create them all up-front and reuse the resulting statement
     279  objects, it'll save memory and time.
     280
     281* Migrate the source repo to Fossil (when there's a
     282  kitten-technologies.co.uk migration to Fossil), and update the egg
     283  locations thingy.
     284
    252285## Backends
    253286
    254 * Eradicate all GPL taint from gdbm by using sqlite for storing
    255   metadata in backends!
    256 
    257 * Remove backend-log. Have just backend-fs, backend-splitlog, and
    258   maybe a backend-sqlite for everything-in-sqlite storage (plus future
    259   S3/SFTP backends). Not including meta-backends such as backend-cache
    260   and backend-replicated.
    261 
    262 * Support for recreating the index and tags on a backend-log or
    263   backend-splitlog if they get corrupted, from the headers left in the
    264   log. Do this by extending the backend protocol with a special
    265   "admin" command that allows for arbitrary backend-specific
    266   operations, and write an ugarit-backend-admin CLI tool to administer
    267   backends with it.
     287* Support for recreating the index and tags on a backend-splitlog if
     288  they get corrupted, from the headers left in the log. Do this by
     289  extending the backend protocol with a special "admin" command that
     290  allows for arbitrary backend-specific operations, and write an
     291  ugarit-backend-admin CLI tool to administer backends with it.
    268292
    269293* Support for unlinking in backend-splitlog, by marking byte ranges as
     
    308332## Core
    309333
    310 * Eradicate all GPL taint from gdbm by using sqlite for storing
    311   the mtime cache!
    312 
    313 * Better error handling. Right now we give up if we can't read a file
    314   or directory. It would be awesomer to print a warning but continue
    315   to archive everything else.
     334* API documentation for the units we export
    316335
    317336* More `.ugarit` actions. Right now we just have exclude and include;
     
    400419* Better error messages
    401420
     421* Line editing in the "explore" CLI, ideally with tab completion
     422
     423* API mode: Works something like the backend API, except at the
     424  archive level. Supports all the important archive operations, plus
     425  access to sexpr stream writers and key stream writers,
     426  archive-node-fold, etc. Requested by andyjpb, perhaps I can write
     427  the framework for this and then let him add API functions as he desires.
     428
    402429* FUSE support. Mount it as a read-only filesystem :-D Then consider
    403430  adding Fossil-style writing to the `current` of a snapshot, with
     
    407434* Filesystem watching. Even with the hash-caching trick, a snapshot
    408435  will still involve walking the entire directory tree and looking up
    409   every file in the hash cash. We can do better than that - some
     436  every file in the hash cache. We can do better than that - some
    410437  platforms provide an interface for receiving real-time notifications
    411438  of changed or added files. Using this, we could allow ugarit to run
     
    425452  to an on-disk filesystem, while we're at it.
    426453
    427 * A more formal test corpus with a unit test script around the
    428   `ugarit` command-line tool; the corpus should contain a mix of tiny
    429   and huge files and directories, awkward cases for sharing of blocks
    430   (many identical files in the same dir, etc), complex forms of file
    431   metadata, and so on. It should archive and restore the corpus
    432   several times over with each hash, compression, and encryption
    433   option.
     454* A unit test script around the `ugarit` command-line tool; the corpus
     455  should contain a mix of tiny and huge files and directories, awkward
     456  cases for sharing of blocks (many identical files in the same dir,
     457  etc), complex forms of file metadata, and so on. It should archive
     458  and restore the corpus several times over with each hash,
     459  compression, and encryption option.
    434460
    435461# Acknowledgements
     
    475501
    476502Moving on from the world of backup, I'd like to thank the Chicken Team
    477 for producing Chicken Scheme. Felix, Peter, Elf, and Alex have
    478 particularly inspired me with their can-do attitudes to combining
    479 programming-language elegance and pragmatic engineering - two things
    480 many would think un-unitable enemies. Of course, they didn't do it all
    481 themselves - R5RS Scheme and the SRFIs provided a solid foundation to
    482 build on, and there's a cast of many more in the Chicken community,
    483 working on other bits of Chicken or just egging everyone on. And I
    484 can't not thank Henry Baker for writing the seminal paper on the
    485 technique Chicken uses to implement full tail-calling Scheme with
    486 cheap continuations on top of C; Henry already had my admiration for
    487 his work on combining elegance and pragmatism in linear logic. Why
    488 doesn't he return my calls? I even sent flowers.
     503for producing Chicken Scheme. Felix and the community at #chicken on
     504Freenode have particularly inspired me with their can-do attitudes to
     505combining programming-language elegance and pragmatic engineering -
     506two things many would think un-unitable enemies. Of course, they
     507didn't do it all themselves - R5RS Scheme and the SRFIs provided a
     508solid foundation to build on, and there's a cast of many more in the
     509Chicken community, working on other bits of Chicken or just egging
     510everyone on. And I can't not thank Henry Baker for writing the seminal
     511paper on the technique Chicken uses to implement full tail-calling
     512Scheme with cheap continuations on top of C; Henry already had my
     513admiration for his work on combining elegance and pragmatism in linear
     514logic. Why doesn't he return my calls? I even sent flowers.
    489515
    490516A special thanks should go to Christian Kellermann for porting Ugarit
     
    499525# Version history
    500526
     527* 1.0: Migrated from gdbm to sqlite for metadata storage, removing the
     528  GPL taint. Unit test suite. backend-cache made into a separate
     529  backend binary. Removed backend-log. BUGFIX: file caching uses mtime *and*
     530  size now, rather than just mtime. Error handling so we skip objects
     531  that we cannot do something with, and proceed to try the rest of the
     532  operation.
     533
    501534* 0.8: decoupling backends from the core and into separate binaries,
    502535  accessed via standard input and output, so they can be run over SSH
Note: See TracChangeset for help on using the changeset viewer.