25개 이상의 토픽을 선택하실 수 없습니다. Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

11 년 전
11 년 전
11 년 전
11 년 전
11 년 전
11 년 전
11 년 전
11 년 전
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
  1. Archive Team megawarc factory
  2. =============================
  3. Some scripts to bundle Archive Team uploads and upload them to Archive.org. Use at your own risk; the scripts will need per-project adjustment.
  4. These scripts make batches of uploaded warc.gz files, combine them into megawarcs and upload them to their permanent home on Archive.org.
  5. Three processes work together to make this happen:
  6. 1. The chunker
  7. --------------
  8. The chunker moves uploaded warc.gz files from the upload directory to a batch directory. When this directory has grown to 50GB, the chunker begins a new directory and moves the completed directory to the packing queue.
  9. There can only be one chunker per upload directory. Chunking doesn't take long, if the files are not moving to a different filesystem.
  10. 2. The packer
  11. -------------
  12. The packer monitors the packing queue. When the chunker brings a new directory, the packer removes the directory from the queue and starts converting it into a megawarc (using the megawarc utility). When that is done, the packer moves the megawarc to the upload queue and removes the original warc files.
  13. If necessary, multiple packers can work the same queue. Packing involves lots of gzipping and takes some time.
  14. 3. The uploader
  15. ---------------
  16. The uploader monitors the upload queue. When the packer brings a new megawarc, the uploader removes the megawarc from the queue and uploads it to Archive.org. If the upload is successful, the uploader removes the megawarc.
  17. If necessary, multiple uploaders can work the same queue.
  18. 4. The offloader
  19. ---------------
  20. The offloader monitors the upload queue. Instead of uploading to Archive.org, the megawarc will be sent to another host via rsync. This is useful when Archive.org has issues.
  21. This can be used at the same time as the uploader without issues.
  22. Filesystems
  23. -----------
  24. From the chunker to the uploader, the chunks move through the system as timestamped directories, e.g., 20130401213900.) This timestamp will also be used in the name of the uploaded item on Archive.org. The queues are directories. Processes 'claim' a chunk by moving it from the queue directory to their working directory. This assumes that `mv` is an atomic operation.
  25. For efficiency and to maintain the atomicity of `mv`, the filesystem of the directories is very important:
  26. 1. The Rsync upload directory, the chunker working directory, the packing queue and that side of the packer's working directory should all be on the same filesystem. This ensures that the uploaded warc.gz files never move to a different file system.
  27. 2. The megawarc side of the packer's working directory, the upload queue and the uploader's working directory should also share a filesystem.
  28. Filesystems 1 and 2 do not have to be the same.
  29. Configuration
  30. -------------
  31. Create a configuration file called `config.sh` and place it in the directory where you start the scripts. See the `config.example.sh` for more details.
  32. Running
  33. -------
  34. Run the scripts in `screen`, `tmux` or something similar. `touch RUN` before you start the scripts. Use `rm RUN` to stop gracefully.
  35. * `./chunk-multiple` (run exactly one)
  36. * `./pack-multiple` (you may run more than one)
  37. * `./upload-multiple` (you may run more than one)
  38. * `./offload-multiple` (you may run more than one, can work in tandem with `upload-multiple`)
  39. Utility scripts:
  40. * `./du-all` will run `du -hs` in all queues
  41. Scheduling priorities
  42. ---------------------
  43. The packing script will use all your I/O capacity. Consider using `nice` and `ionice` to run in at a lower priority, so it doesn't hinder your incoming Rsync or outgoing curl uploads.
  44. * `ionice -c 2 -n 6 nice -n 19 ./pack-multiple`
  45. Recovering from errors
  46. ----------------------
  47. The scripts are designed not to lose data. If a script dies, you can look in its working directory for in-progress items and move them back to the queue.
  48. Requirements
  49. ------------
  50. These scripts use Bash and Curl.
  51. You should clone https://github.com/ArchiveTeam/megawarc to the `megawarc/` subdirectory of these scripts. The megawarc utility requires Python and Gzip.