Browse Source

Initial commit

master
JustAnotherArchivist 1 year ago
commit
8d0068a573
3 changed files with 43 additions and 0 deletions
  1. +3
    -0
      Dockerfile
  2. +6
    -0
      README.md
  3. +34
    -0
      mercurial-dl

+ 3
- 0
Dockerfile View File

@@ -0,0 +1,3 @@
FROM atdr.meo.ws/archiveteam/mercurial-grab@sha256:c25841efe3679952eef9418a9784b09b7668918f753227890a10d3e3c404b108
COPY mercurial-dl /grab/
ENTRYPOINT ["./mercurial-dl"]

+ 6
- 0
README.md View File

@@ -0,0 +1,6 @@
A small wrapper around the mercurial-grab project container that dumps individual repositories on demand.

docker build -t mercurial-dl:latest .
docker run --rm -v $(pwd)/data:/grab/data mercurial-dl:latest https://hg.mozilla.org/penelope/

Accepts any number of URLs as arguments. The output WARCs are *not* merged together. See <https://archive.org/details/hg.libsdl.org_clone_warc_20210211> for an example of what the output might look like (although the directory naming here would be slightly different).

+ 34
- 0
mercurial-dl View File

@@ -0,0 +1,34 @@
#!/bin/bash
for url
do
name="${url//\//_}"
itemDir="./data/${name}/"
mkdir -p "${itemDir}"
warcName="${name}"
item_dir="${itemDir}" item_value="${url}" warc_file_base="${warcName}" /grab/wget-at \
-U 'mercurial/proto-1.0 (Mercurial 5.3.1)' \
-nv \
--no-cookies \
--content-on-error \
--lua-script mercurial.lua \
-o "${itemDir}/wget.log" \
--no-check-certificate \
--output-document "${itemDir}/wget.tmp" \
--truncate-output \
-e robots=off \
--rotate-dns \
--recursive --level=inf \
--no-parent \
--page-requisites \
--timeout 30 \
--tries inf \
--span-hosts \
--waitretry 30 \
--warc-file "${itemDir}/${warcName}-main" \
--warc-header 'operator: Archive Team' \
--warc-header 'mercurial-dld-script-version: 20201031.01' \
--warc-dedup-url-agnostic \
--warc-header "mercurial-repository: ${url}" \
--warc-header 'warc-type: main' \
"${url}?cmd=capabilities"
done

Loading…
Cancel
Save