JustAnotherArchivist
3cc3a1ed38
Fix nested tags
E.g. <Owner> tag which has <ID> and <DisplayName>, e.g. https://appengage-video.s3.amazonaws.com/
4 лет назад
JustAnotherArchivist
5c907488e1
Handle broken pipe on stdout
4 лет назад
JustAnotherArchivist
b38349e91f
Fix duplicate slashes
4 лет назад
JustAnotherArchivist
f23e4cc71e
Retry on internal errors
4 лет назад
JustAnotherArchivist
bfe5f59e25
Add marker loop detection
4 лет назад
JustAnotherArchivist
66bdef3247
Take a bucket URL argument instead of hostname + bucketname
4 лет назад
JustAnotherArchivist
e385c1d302
Limit curl to 10 seconds
4 лет назад
JustAnotherArchivist
74162445aa
Replace curl-archivebot-ua with a more general curl-ua script that supports different UAs selected by aliases
4 лет назад
JustAnotherArchivist
9d712d64d7
Ignore certain URLs on Twitter and Instagram entirely
4 лет назад
JustAnotherArchivist
87826d4844
Use line variable instead of prefix+url
4 лет назад
JustAnotherArchivist
163aacf13c
Print deletion URL on stderr
4 лет назад
JustAnotherArchivist
486a593f15
Add support for more weird Facebook URLs
4 лет назад
JustAnotherArchivist
256a94443e
Fix deduplication within each section processing
4 лет назад
JustAnotherArchivist
98d77ecc96
Deduplicate output
This uses mawk's extensions `-W interactive` and `delete array`; it will probably work with certain other AWK implementations as well, but for now it depends on mawk explicitly.
4 лет назад
JustAnotherArchivist
6ce64baf87
Remove redundant url-normalise after the extraction
Since all input is run through url-normalise before processing and all output of website and social media extraction is also normalised, it's not necessary to re-normalise again at the end.
4 лет назад
JustAnotherArchivist
318183148e
Fix URL extraction from Facebook profile overview pages
4 лет назад
JustAnotherArchivist
869ade27eb
Separate names in stderr annotations for the various url-normalise processes
4 лет назад
JustAnotherArchivist
79f0bd4332
Normalise URLs everywhere to reduce duplicates
4 лет назад
JustAnotherArchivist
dc4efcfbfb
One URL normalisation script to rule them all
Consolidate social media profile, YouTube, and (new) generic web page URL normalisation into one script
4 лет назад
JustAnotherArchivist
0f13a1fadd
Add verbosity options, and annotate stderr on wiki-recursive-extract
4 лет назад
JustAnotherArchivist
3ec816cd04
Add script for link extraction from social media profiles
4 лет назад
JustAnotherArchivist
5285c406d9
Add script for recursive website and social media discovery
4 лет назад
JustAnotherArchivist
2be9ca922e
Ignore more useless Facebook links
4 лет назад
JustAnotherArchivist
c3b0e5543e
Add support for facebook.com/pg/something
4 лет назад
JustAnotherArchivist
7c389f1fef
Add support for hashbang fragments on Twitter links
4 лет назад
JustAnotherArchivist
c56736bc4a
Ignore /intent on Twitter
4 лет назад
JustAnotherArchivist
4f34753788
Add support for Instagram posts and ignore spurious links from the CDN
4 лет назад
JustAnotherArchivist
ad030f5d21
Add support for Facebook pages and groups
4 лет назад
JustAnotherArchivist
cd0b3f6214
Ignore /vi/* on YouTube (video thumbnails)
4 лет назад
JustAnotherArchivist
6f1cca73ad
Support hashtags
4 лет назад
JustAnotherArchivist
c61efa03f0
Make social media normalisation script snscrape-independent
4 лет назад
JustAnotherArchivist
e6008eb971
Add script for automatic social media discovery
4 лет назад
JustAnotherArchivist
fed66542fa
Support python3 in any directory instead of just /usr/bin
4 лет назад
JustAnotherArchivist
5982e131a4
Stop gracefully when encountering a SIGPIPE
4 лет назад
JustAnotherArchivist
c13a1150df
Add support for WARC/1.1
4 лет назад
JustAnotherArchivist
376cde7b8c
Fix broken block digest calculation on malformed HTTP responses
4 лет назад
JustAnotherArchivist
b121cbd958
Write all log messages to stderr
4 лет назад
JustAnotherArchivist
ed1270d988
Add support for upper-cased chunk lengths
4 лет назад
JustAnotherArchivist
d4826abde2
Add record ID to log messages
4 лет назад
JustAnotherArchivist
4925a912c0
Add youtube-filter-autogen-channels
4 лет назад
JustAnotherArchivist
9b8f223776
Add wiki-sections-sort
4 лет назад
JustAnotherArchivist
552a4147c2
Fix not returning complete body for non-chunked responses
Leftover from debugging
4 лет назад
JustAnotherArchivist
0dc0de6b50
Add support for lists
4 лет назад
JustAnotherArchivist
9d344df8c6
+x
4 лет назад
JustAnotherArchivist
f6a7cbfc70
Fix --with-list-urls help message
4 лет назад
JustAnotherArchivist
9743aa7c35
Add s3-bucket-list
4 лет назад
JustAnotherArchivist
91adce786f
Add YouTube normalisation script
5 лет назад
JustAnotherArchivist
5ca90c3b7d
Update tmux session commands
5 лет назад
JustAnotherArchivist
679923d37d
Add support for Twitter hashtag extraction
5 лет назад
JustAnotherArchivist
663383830c
Add support for lists
5 лет назад