JustAnotherArchivist
b59b82041c
Add support for wiki list entries with options
4 years ago
JustAnotherArchivist
d5953ca95c
Use old Opera UA for Twitter to force the old design
4 years ago
JustAnotherArchivist
1fa57d41a3
Fix extraction on Wix sites from JSON inside a data attribute
Example: https://www.martinedocourt.ch/
4 years ago
JustAnotherArchivist
4a742162d0
Suppress output if there are no matched jobs
4 years ago
JustAnotherArchivist
fe72d57d7e
Add filtering based on substrings anywhere in the string and on regex
4 years ago
JustAnotherArchivist
cf30a53f82
Add case-insensitive filtering
4 years ago
JustAnotherArchivist
711e444e8e
Highlight jobs that have been inactive for over 6 hours
4 years ago
JustAnotherArchivist
b2919030ab
Fix sorting on numerical columns
4 years ago
JustAnotherArchivist
257b578fbe
Add descending sort
4 years ago
JustAnotherArchivist
6e7449d137
Support column names in any capitalisation
4 years ago
JustAnotherArchivist
e5e7bdf8af
Add more filtering options
4 years ago
JustAnotherArchivist
c611420be9
Remove options from usage line
4 years ago
JustAnotherArchivist
824eb5e353
Add script for getting an AB job overview table
4 years ago
JustAnotherArchivist
34c1a58034
Fix detection of multiple transfer encodings
4 years ago
JustAnotherArchivist
195df08cd5
Fix marker loop on some filenames due to lacking HTML entity processing
E.g. https://audio-market-dev.s3.amazonaws.com/?marker=media/23/Hard%20Style%20Producer
4 years ago
JustAnotherArchivist
3cc3a1ed38
Fix nested tags
E.g. <Owner> tag which has <ID> and <DisplayName>, e.g. https://appengage-video.s3.amazonaws.com/
4 years ago
JustAnotherArchivist
5c907488e1
Handle broken pipe on stdout
4 years ago
JustAnotherArchivist
b38349e91f
Fix duplicate slashes
4 years ago
JustAnotherArchivist
f23e4cc71e
Retry on internal errors
4 years ago
JustAnotherArchivist
bfe5f59e25
Add marker loop detection
4 years ago
JustAnotherArchivist
66bdef3247
Take a bucket URL argument instead of hostname + bucketname
4 years ago
JustAnotherArchivist
e385c1d302
Limit curl to 10 seconds
4 years ago
JustAnotherArchivist
74162445aa
Replace curl-archivebot-ua with a more general curl-ua script that supports different UAs selected by aliases
4 years ago
JustAnotherArchivist
9d712d64d7
Ignore certain URLs on Twitter and Instagram entirely
4 years ago
JustAnotherArchivist
87826d4844
Use line variable instead of prefix+url
4 years ago
JustAnotherArchivist
163aacf13c
Print deletion URL on stderr
4 years ago
JustAnotherArchivist
486a593f15
Add support for more weird Facebook URLs
4 years ago
JustAnotherArchivist
256a94443e
Fix deduplication within each section processing
4 years ago
JustAnotherArchivist
98d77ecc96
Deduplicate output
This uses mawk's extensions `-W interactive` and `delete array`; it will probably work with certain other AWK implementations as well, but for now it depends on mawk explicitly.
4 years ago
JustAnotherArchivist
6ce64baf87
Remove redundant url-normalise after the extraction
Since all input is run through url-normalise before processing and all output of website and social media extraction is also normalised, it's not necessary to re-normalise again at the end.
4 years ago
JustAnotherArchivist
318183148e
Fix URL extraction from Facebook profile overview pages
4 years ago
JustAnotherArchivist
869ade27eb
Separate names in stderr annotations for the various url-normalise processes
4 years ago
JustAnotherArchivist
79f0bd4332
Normalise URLs everywhere to reduce duplicates
4 years ago
JustAnotherArchivist
dc4efcfbfb
One URL normalisation script to rule them all
Consolidate social media profile, YouTube, and (new) generic web page URL normalisation into one script
4 years ago
JustAnotherArchivist
0f13a1fadd
Add verbosity options, and annotate stderr on wiki-recursive-extract
4 years ago
JustAnotherArchivist
3ec816cd04
Add script for link extraction from social media profiles
4 years ago
JustAnotherArchivist
5285c406d9
Add script for recursive website and social media discovery
4 years ago
JustAnotherArchivist
2be9ca922e
Ignore more useless Facebook links
4 years ago
JustAnotherArchivist
c3b0e5543e
Add support for facebook.com/pg/something
4 years ago
JustAnotherArchivist
7c389f1fef
Add support for hashbang fragments on Twitter links
4 years ago
JustAnotherArchivist
c56736bc4a
Ignore /intent on Twitter
4 years ago
JustAnotherArchivist
4f34753788
Add support for Instagram posts and ignore spurious links from the CDN
4 years ago
JustAnotherArchivist
ad030f5d21
Add support for Facebook pages and groups
4 years ago
JustAnotherArchivist
cd0b3f6214
Ignore /vi/* on YouTube (video thumbnails)
4 years ago
JustAnotherArchivist
6f1cca73ad
Support hashtags
4 years ago
JustAnotherArchivist
c61efa03f0
Make social media normalisation script snscrape-independent
4 years ago
JustAnotherArchivist
e6008eb971
Add script for automatic social media discovery
4 years ago
JustAnotherArchivist
fed66542fa
Support python3 in any directory instead of just /usr/bin
4 years ago
JustAnotherArchivist
5982e131a4
Stop gracefully when encountering a SIGPIPE
4 years ago
JustAnotherArchivist
c13a1150df
Add support for WARC/1.1
4 years ago