JustAnotherArchivist
2e1dc59e9d
Fix log level of one message
pirms 3 gadiem
JustAnotherArchivist
f025c4e9f3
Add extensive debug logging
pirms 3 gadiem
JustAnotherArchivist
ce7f8fdc92
Make optional arguments to fetch kwarg-only
pirms 3 gadiem
JustAnotherArchivist
3c8b45b3a6
Refactor cleanup code
- Run the cleanup code on exceptions (e.g. ^C). There were several effects of that not happening previously; most notably, the log file was not written to the meta WARC.
- Cancel remaining tasks, which avoids a pile of asyncio warnings and errors on crashes.
- Close the DB before the WARC, or rather, close the WARC last. This is mostly a semantic change to further ensure that the log written to the meta WARC is as complete as possible.
pirms 3 gadiem
JustAnotherArchivist
dcd5455388
Fix crash on starting a run while the DB is locked
pirms 3 gadiem
JustAnotherArchivist
168fa78736
Avoid locking the DB when there are no subitems to insert
pirms 3 gadiem
JustAnotherArchivist
4484d6c588
Add Item representation
pirms 3 gadiem
JustAnotherArchivist
5675118877
Rename id to id_ to avoid clash with builtin
pirms 3 gadiem
JustAnotherArchivist
a1e693739e
Replace DB locking with an async context manager
pirms 3 gadiem
JustAnotherArchivist
15203bd991
Handle redirect traps/loops
pirms 3 gadiem
JustAnotherArchivist
f8f5258197
Track redirect depth
pirms 3 gadiem
JustAnotherArchivist
a3d6fb35f8
Turn response handlers into kwarg-only functions for easier extendability without breaking existing code
pirms 3 gadiem
JustAnotherArchivist
6cc4adb901
Remove stray TODO
The DB creation operates with a DB lock, so that code can't run while another process is filling the DB; it would block on obtaining the lock a few lines prior instead.
pirms 3 gadiem
JustAnotherArchivist
c5604ef965
Simplify header merging
pirms 3 gadiem
JustAnotherArchivist
59ae1183d2
Add fromResponse parameter for URL completion and automatic Referer header
pirms 3 gadiem
JustAnotherArchivist
2324216016
Add baseUrl and evaluate incomplete URLs relative to it
pirms 3 gadiem
JustAnotherArchivist
b30ccf8bf8
Move response/exception history to ClientResponse.qhistory
It is rarely necessary to access the history, and the tuple return value clutters the spec file code.
As a consequence, it's no longer possible to return None if an error occurred without losing the history.
To replace that, this also introduces a DummyClientResponse, which is kind of ClientResponse-like, has the same qhistory attribute, and evaluates to False when cast to bool (such that the intuitive `if response` works as expected).
pirms 3 gadiem
JustAnotherArchivist
e69527c715
Add defaultResponseHandler on the Item level
pirms 3 gadiem
JustAnotherArchivist
03336e4988
Add item to response handler arguments (e.g. for logging)
pirms 3 gadiem
JustAnotherArchivist
6bdcfe71f0
Refactor database creation and item generation: call `Item.generate()` on every qwarc run and dedupe its output, allowing the addition of further items by modifying the spec file
pirms 3 gadiem
JustAnotherArchivist
c878241f24
Switch from concurrent.futures.CancelledError to asyncio.CancelledError
Since Python 3.8, the latter does not inherit from the former anymore.
pirms 3 gadiem
JustAnotherArchivist
749158b97a
Use the Future's result directly rather than awaiting again
The asyncio documentation does not specify whether awaiting a Future multiple times is supported or not: https://bugs.python.org/issue41275
pirms 3 gadiem
JustAnotherArchivist
a85e80ffa2
Configurable request timeout
pirms 3 gadiem
JustAnotherArchivist
429ac94689
Make it possible to override and remove headers
pirms 3 gadiem
JustAnotherArchivist
e40be54578
Document verify_ssl parameter
pirms 3 gadiem
JustAnotherArchivist
d3437bde19
Move default headers to qwarc.const
pirms 3 gadiem
JustAnotherArchivist
1678075a89
Log traceback on exceptions raised from an item
pirms 4 gadiem
JustAnotherArchivist
b1a1c03f7e
Handle STOP file and high memory usage before full disk to allow stopping while the disk is above the limit
pirms 4 gadiem
JustAnotherArchivist
dd44d9b174
Adjust logging levels: log individual request failures only at WARNING and cancelled tasks at ERROR level
pirms 4 gadiem
JustAnotherArchivist
91035d769c
Catch exceptions in Item.process and mark the items as errors instead of crashing
pirms 4 gadiem
JustAnotherArchivist
69984765b3
Fix taskType typo silencing cancellation warnings
pirms 4 gadiem
JustAnotherArchivist
c263ad0b03
Return ClientResponse object from fetch only if the retrieval was successful
If an exception was raised and caught, the object is still present in the history.
pirms 4 gadiem
JustAnotherArchivist
cb0d11284e
Write only successful retrievals (i.e. ones that don't cause an exception) to WARC
pirms 4 gadiem
JustAnotherArchivist
1214409a0b
Flush big responses to a temporary file instead of trying to keep everything in-memory
pirms 4 gadiem
JustAnotherArchivist
08d96b37c5
Support deep/multiple inheritance from Item
pirms 4 gadiem
JustAnotherArchivist
9d8de13775
Add Item.flush_subitems to flush the new subitems to the database while the item is still being processed
This also renames add_item to add_subitem for clarity.
pirms 4 gadiem
JustAnotherArchivist
50b936b18c
Refactor QWARC class to keep relevant variables in instance attributes instead of local variables
pirms 4 gadiem
JustAnotherArchivist
c5d8d93166
Remove stray whitespace
pirms 4 gadiem
JustAnotherArchivist
7d53577522
Add parameter for disabling SSL/TLS certificate validation
pirms 4 gadiem
JustAnotherArchivist
50d46ad51c
Use log filename in the target URI of the log resource record
pirms 4 gadiem
JustAnotherArchivist
a5dfd5c805
Write spec file + its dependencies and command line to meta WARC
pirms 4 gadiem
JustAnotherArchivist
d751844626
Fix starting another item before stopping on STOP file or memory limit exceedance
pirms 4 gadiem
JustAnotherArchivist
2b0778f9b5
Remove leftovers from initial code rewrite
pirms 4 gadiem
JustAnotherArchivist
ab22966fef
Add to log which item a message is coming from
pirms 4 gadiem
JustAnotherArchivist
6fafd32685
Error when the retries are exceeded
pirms 4 gadiem
JustAnotherArchivist
8647d6b396
Use f-strings instead of str.format
pirms 4 gadiem
JustAnotherArchivist
5008e6e8cd
Deduplicate items
pirms 4 gadiem
JustAnotherArchivist
46c95e2157
Disable decoding the response content
chardet can be very slow (https://github.com/chardet/chardet/issues/29 https://github.com/psf/requests/issues/2359 ) and the decoding may be unnecessary if it's binary content.
pirms 4 gadiem
JustAnotherArchivist
ad22a2327a
Support adding headers to individual requests
pirms 5 gadiem
JustAnotherArchivist
67076f964c
Add support for POST requests
pirms 5 gadiem