The little things give you away... A collection of various small helper stuff
Vous ne pouvez pas sélectionner plus de 25 sujets Les noms de sujets doivent commencer par une lettre ou un nombre, peuvent contenir des tirets ('-') et peuvent comporter jusqu'à 35 caractères.
 
 
 
JustAnotherArchivist 81bba9a631 Add --size-hint option il y a 4 semaines
.gitignore Add infrastructure for simple C-based tools il y a 2 ans
.make-and-exec Warnings are bad, mmkay? il y a 11 mois
.urldecode-test Get rid of Makefile for more control; add proper debug build support il y a 11 mois
.warc-dump-responses-test Add test for warc-dump-responses il y a 4 mois
.youtube-extract-rapid-test Get rid of Makefile for more control; add proper debug build support il y a 11 mois
LICENSE Initial commit il y a 5 ans
README.md Initial commit il y a 5 ans
alphabetseq Swap syntaxes il y a 2 ans
archivebot-blogspot Fix HTTPS handling il y a 4 ans
archivebot-compress-db Add archivebot-compress-db il y a 3 mois
archivebot-fix-queue-counters Fix TypeError il y a 1 an
archivebot-high-resources Replace archivebot-high-memory with more capable archivebot-high-resources il y a 3 mois
archivebot-irccloud-paste Add archivebot-irccloud-paste il y a 3 ans
archivebot-jobid-calculation More snscrape helper tools il y a 4 ans
archivebot-jobs Not-so-new new ArchiveBot domain il y a 1 an
archivebot-list-stuck-requests Fix line endings il y a 5 ans
archivebot-log-extract-ignores Add archivebot-log-extract-ignores il y a 3 ans
archivebot-monitor-job-queue First set of little things il y a 5 ans
archivebot-pipelines-count-jobs Add archivebot-pipelines-count-jobs il y a 4 mois
archivebot-youtube Add helper for AB/chromebot-ing YouTube channels and users il y a 4 ans
at-tracker-sample-user-item-size Add at-tracker-sample-user-item-size il y a 2 ans
azure-storage-list Add --jsonl option il y a 2 ans
b64grep Add b64grep il y a 2 ans
base64url Add base64url il y a 2 ans
bencode2json Add bencode2json il y a 1 an
bing-scrape Fix extraction of search results il y a 5 mois
bugzilla-url-list Add Bugzilla URL list generator il y a 2 ans
cdx-chunk Add cdx-chunk il y a 2 ans
cloudflare-email-decode Add cloudflare-email-decode il y a 1 an
combine-by-prefix Add combine-by-prefix il y a 2 ans
curl-ia Add header mode (e.g. for tasks API) il y a 1 an
curl-ua Add IE6 UA il y a 3 ans
deb-repo-urls Fix deb file URLs il y a 3 ans
dedupe Another alternative and performance/memory comparison il y a 3 ans
dir-to-ia Add dir-to-ia il y a 10 mois
europarl-meps-collect Add script for scraping MEP links from europarl.europa.eu il y a 4 ans
extract-urls-for-archiveteam-projects Add wpull2-extract-ignored-offsite and extract-urls-for-archiveteam-projects il y a 3 mois
foolfuuka-search Better workaround for the 5000 results limit; works for FoolFuuka 2.0.1 and up il y a 5 ans
format-size Split out size formatting il y a 4 ans
fos-ftp-upload First set of little things il y a 5 ans
get-crx4chrome-urls First set of little things il y a 5 ans
github-list-repos Add support for redesigned org repo list il y a 1 mois
gitlab-list-repos Add support for other instances and full-instance listing il y a 2 ans
gofile.io-dl Add support for password-protected folders il y a 2 ans
html-extract-stupid Handle 
 and 
 il y a 9 mois
http-response-bodies Add http-response-bodies il y a 1 an
http-response-bodies.c Fix extra LF between chunks il y a 8 mois
ia-cdx-search Fix error when no arguments are provided il y a 1 an
ia-cdx-search-subdomains Fix URLs without a path il y a 1 an
ia-derive Queue derives with `ia tasks` instead of this manual curl rubbish il y a 2 ans
ia-files-xml-to-jsonl Guarantee stable output order il y a 2 ans
ia-upload-progress Proper script for tracking size of uploaded data il y a 4 ans
ia-upload-stream Add --size-hint option il y a 4 semaines
ia-verify-file Add a timeout to prevent potentially indefinite blocking il y a 2 ans
ia-wait-item-tasks Handle error tasks by exiting non-zero il y a 10 mois
iasha1check Fix output sometimes appearing after prompt il y a 1 an
ix.io-upload Allow overriding the "remote filename" il y a 5 ans
kill-connections Handle processes with too many open connections il y a 1 an
kill-wpull-connections Merge kill-wpull-connections repository into little-things il y a 3 ans
killcx-all-https First set of little things il y a 5 ans
mastodon-enumerate-users Enumerate users on a Mastodon instance il y a 4 ans
mastodon-outdated Finding outdated Mastodon instances il y a 5 ans
moinmoin-url-list Add moinmoin-url-list il y a 3 mois
parent-urls Refactor, strip query/fragment il y a 3 ans
pipelines-launch-in-tmux-windows First set of little things il y a 5 ans
pipelines-monitor-tmux-wget-outcomes Monitor how a pipeline's wget processes are faring il y a 5 ans
pipelines-stop-gracefully First set of little things il y a 5 ans
reddit-pushshift-search Add Bing, Reddit/Pushshift, and FoolFuuka scrapers il y a 5 ans
run-every-five-minutes First set of little things il y a 5 ans
s3-bucket-find-direct-url Add support for PermanentRedirect error responses il y a 5 mois
s3-bucket-list Enable line buffering on list URLs FD il y a 6 mois
s3-bucket-list-qwarc Add JSONL output option for S3 listing il y a 2 ans
snscrape-extract Add support for Twitter hashtag extraction il y a 4 ans
snscrape-facebook-user Silence by default il y a 4 ans
snscrape-instagram-user Silence by default il y a 4 ans
snscrape-prepare-commands Add support for Twitter hashtag extraction il y a 4 ans
snscrape-tmux Update tmux session commands il y a 4 ans
snscrape-twitter-filter Filter Twitter hashtag scrapes based on account scrapes il y a 5 ans
snscrape-twitter-hashtag Extract external links from Twitter il y a 4 ans
snscrape-twitter-user Extract external links from Twitter il y a 4 ans
snscrape-upload Print Instagram ignore immediately after upload instead of at the end il y a 4 ans
snscrape-vk-user Silence by default il y a 4 ans
snscrape-wiki-transfer-merge Helper tools for snscrape and the wiki pages il y a 4 ans
social-media-extract-profile-link Fix decoding of links on Facebook profiles il y a 4 ans
sum-sizes Avoid float roundtrip for integer values il y a 4 mois
tar-many-files-progress First set of little things il y a 5 ans
tcp-closer Add tcp-closer command il y a 4 ans
torrent-tiny Fix negative ints il y a 1 an
transfer.archivete.am-upload Handle HTTP/2 lowercase headers il y a 2 ans
transfer.notkiska.pw-check-ia Switch to HTTPS il y a 3 ans
uniqify Add uniqify il y a 4 ans
uniqify-recent Add uniqify-recent il y a 1 mois
url-normalise Normalise domain name to lower-case before further processing il y a 4 ans
urldecode Add URL/percent decoding tool il y a 2 ans
urldecode.c Fix unused argc and argv error il y a 10 mois
urlsort Add urlsort il y a 2 ans
warc-dump-responses Add warc-dump-responses il y a 1 an
warc-dump-responses.c Fix error when the terminating CRLFCRLF of a record is truncated il y a 4 mois
warc-peek Allow negative offsets to peek near the end of the file il y a 1 an
warc-size Split out size formatting il y a 4 ans
warc-tiny Fix empty files being considered valid WARCs il y a 8 mois
website-extract-social-media Add support for Facebook /pages/category/Category/Name-ID URLs il y a 4 ans
wget-spider-estimate-size First set of little things il y a 5 ans
wiki-list-to-main Add ArchiveBot wiki list helper il y a 4 ans
wiki-recursive-extract-normalise Fix deduplication within each section processing il y a 4 ans
wiki-sections-sort Add wiki-sections-sort il y a 4 ans
wiki-website-extract-social-media Add script for automatic social media discovery il y a 4 ans
wpull1-parallel-progress-monitor First set of little things il y a 5 ans
wpull1-progress-monitor First set of little things il y a 5 ans
wpull2-extract-ignored Remove filtering of onsite URLs because it's unreliable il y a 3 mois
wpull2-extract-remaining Clean up wpull DB commands il y a 3 ans
wpull2-log-colourise Add wpull2-log-colourise il y a 1 an
wpull2-log-extract-errors Treat NXDOMAIN and no A/AAAA record errors as ok il y a 3 ans
wpull2-requeue Error on unknown options il y a 1 an
wpull2-url-origin Clean up wpull DB commands il y a 3 ans
youtube-channel-list.py Use _type instead of key check hack il y a 1 an
youtube-extract Exclude backslashes in channel patterns il y a 1 an
youtube-extract-rapid Add youtube-extract-rapid il y a 2 ans
youtube-extract-rapid.c Add youtube-extract-rapid il y a 2 ans
youtube-filter-autogen-channels Add youtube-filter-autogen-channels il y a 4 ans
zstdwarccat Fix 'Dictionary mismatch' error when very small dicts are used because the temporary file isn't written to disk before zstdcat gets executed il y a 2 ans

README.md

Over the past few years, I’ve written and accumulated a number of useful little things to help with archival-related tasks. This repository collects them. I hope someone finds some of them useful.

License (applies to all programs in this repository)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.