The little things give you away... A collection of various small helper stuff
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
JustAnotherArchivist c50a8fd796 Fix 'Dictionary mismatch' error when very small dicts are used because the temporary file isn't written to disk before zstdcat gets executed 2 jaren geleden
LICENSE Initial commit 5 jaren geleden
README.md Initial commit 5 jaren geleden
alphabetseq Swap syntaxes 2 jaren geleden
archivebot-blogspot Fix HTTPS handling 5 jaren geleden
archivebot-high-memory Support python3 in any directory instead of just /usr/bin 4 jaren geleden
archivebot-irccloud-paste Add archivebot-irccloud-paste 3 jaren geleden
archivebot-jobid-calculation More snscrape helper tools 5 jaren geleden
archivebot-jobs Pass through datetime, math, re, and time to --pyfilter 3 jaren geleden
archivebot-list-stuck-requests Fix line endings 5 jaren geleden
archivebot-log-extract-ignores Add archivebot-log-extract-ignores 3 jaren geleden
archivebot-monitor-job-queue First set of little things 5 jaren geleden
archivebot-youtube Add helper for AB/chromebot-ing YouTube channels and users 5 jaren geleden
azure-storage-list Add --jsonl option 2 jaren geleden
b64grep Add b64grep 2 jaren geleden
bing-scrape Add Bing, Reddit/Pushshift, and FoolFuuka scrapers 5 jaren geleden
bugzilla-url-list Add Bugzilla URL list generator 2 jaren geleden
combine-by-prefix Add combine-by-prefix 2 jaren geleden
curl-ua Add IE6 UA 3 jaren geleden
deb-repo-urls Fix deb file URLs 3 jaren geleden
dedupe Another alternative and performance/memory comparison 3 jaren geleden
europarl-meps-collect Add script for scraping MEP links from europarl.europa.eu 5 jaren geleden
foolfuuka-search Better workaround for the 5000 results limit; works for FoolFuuka 2.0.1 and up 5 jaren geleden
format-size Split out size formatting 5 jaren geleden
fos-ftp-upload First set of little things 5 jaren geleden
get-crx4chrome-urls First set of little things 5 jaren geleden
github-list-repos Fix org repo listing on new design/site structure 2 jaren geleden
gitlab-list-repos Add support for other instances and full-instance listing 2 jaren geleden
gofile.io-dl Add support for password-protected folders 2 jaren geleden
ia-cdx-search Fix crash on an empty response 2 jaren geleden
ia-derive Add script to queue derive on IA 5 jaren geleden
ia-files-xml-to-jsonl Guarantee stable output order 3 jaren geleden
ia-upload-progress Proper script for tracking size of uploaded data 5 jaren geleden
ia-verify-file Add a timeout to prevent potentially indefinite blocking 2 jaren geleden
ia-wait-item-tasks Add ia-wait-item-tasks 2 jaren geleden
iasha1check Colourise sha1sum output 3 jaren geleden
ix.io-upload Allow overriding the "remote filename" 5 jaren geleden
kill-wpull-connections Merge kill-wpull-connections repository into little-things 3 jaren geleden
killcx-all-https First set of little things 5 jaren geleden
mastodon-enumerate-users Enumerate users on a Mastodon instance 5 jaren geleden
mastodon-outdated Finding outdated Mastodon instances 5 jaren geleden
parent-urls Refactor, strip query/fragment 3 jaren geleden
pipelines-launch-in-tmux-windows First set of little things 5 jaren geleden
pipelines-monitor-tmux-wget-outcomes Monitor how a pipeline's wget processes are faring 5 jaren geleden
pipelines-stop-gracefully First set of little things 5 jaren geleden
reddit-pushshift-search Add Bing, Reddit/Pushshift, and FoolFuuka scrapers 5 jaren geleden
run-every-five-minutes First set of little things 5 jaren geleden
s3-bucket-list Ignore TLS issues 3 jaren geleden
s3-bucket-list-qwarc Record wrapper script in meta WARC as well 3 jaren geleden
snscrape-extract Add support for Twitter hashtag extraction 4 jaren geleden
snscrape-facebook-user Silence by default 5 jaren geleden
snscrape-instagram-user Silence by default 5 jaren geleden
snscrape-prepare-commands Add support for Twitter hashtag extraction 4 jaren geleden
snscrape-tmux Update tmux session commands 4 jaren geleden
snscrape-twitter-filter Filter Twitter hashtag scrapes based on account scrapes 5 jaren geleden
snscrape-twitter-hashtag Extract external links from Twitter 5 jaren geleden
snscrape-twitter-user Extract external links from Twitter 5 jaren geleden
snscrape-upload Print Instagram ignore immediately after upload instead of at the end 5 jaren geleden
snscrape-vk-user Silence by default 5 jaren geleden
snscrape-wiki-transfer-merge Helper tools for snscrape and the wiki pages 5 jaren geleden
social-media-extract-profile-link Fix decoding of links on Facebook profiles 4 jaren geleden
sum-sizes Add sum-sizes 2 jaren geleden
tar-many-files-progress First set of little things 5 jaren geleden
tcp-closer Add tcp-closer command 5 jaren geleden
transfer.archivete.am-upload Handle HTTP/2 lowercase headers 3 jaren geleden
transfer.notkiska.pw-check-ia Switch to HTTPS 3 jaren geleden
uniqify Add uniqify 5 jaren geleden
url-normalise Normalise domain name to lower-case before further processing 4 jaren geleden
warc-peek Add WARC/1.1 support 3 jaren geleden
warc-size Split out size formatting 5 jaren geleden
warc-tiny Fix compatibility with wpull 2.x 3 jaren geleden
website-extract-social-media Add support for Facebook /pages/category/Category/Name-ID URLs 4 jaren geleden
wget-spider-estimate-size First set of little things 5 jaren geleden
wiki-list-to-main Add ArchiveBot wiki list helper 5 jaren geleden
wiki-recursive-extract-normalise Fix deduplication within each section processing 4 jaren geleden
wiki-sections-sort Add wiki-sections-sort 4 jaren geleden
wiki-website-extract-social-media Add script for automatic social media discovery 4 jaren geleden
wpull1-parallel-progress-monitor First set of little things 5 jaren geleden
wpull1-progress-monitor First set of little things 5 jaren geleden
wpull2-extract-remaining Clean up wpull DB commands 3 jaren geleden
wpull2-log-extract-errors Treat NXDOMAIN and no A/AAAA record errors as ok 3 jaren geleden
wpull2-requeue Print number of modified records on requeueing 2 jaren geleden
wpull2-url-origin Clean up wpull DB commands 3 jaren geleden
youtube-channel-list.py Add YouTube channel listing script 2 jaren geleden
youtube-extract Handle ancient /?v= URLs 2 jaren geleden
youtube-filter-autogen-channels Add youtube-filter-autogen-channels 4 jaren geleden
zstdwarccat Fix 'Dictionary mismatch' error when very small dicts are used because the temporary file isn't written to disk before zstdcat gets executed 2 jaren geleden

README.md

Over the past few years, I’ve written and accumulated a number of useful little things to help with archival-related tasks. This repository collects them. I hope someone finds some of them useful.

License (applies to all programs in this repository)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.