JustAnotherArchivist
6cc4adb901
Remove stray TODO
The DB creation operates with a DB lock, so that code can't run while another process is filling the DB; it would block on obtaining the lock a few lines prior instead.
hace 4 años
JustAnotherArchivist
c5604ef965
Simplify header merging
hace 4 años
JustAnotherArchivist
59ae1183d2
Add fromResponse parameter for URL completion and automatic Referer header
hace 4 años
JustAnotherArchivist
2324216016
Add baseUrl and evaluate incomplete URLs relative to it
hace 4 años
JustAnotherArchivist
b30ccf8bf8
Move response/exception history to ClientResponse.qhistory
It is rarely necessary to access the history, and the tuple return value clutters the spec file code.
As a consequence, it's no longer possible to return None if an error occurred without losing the history.
To replace that, this also introduces a DummyClientResponse, which is kind of ClientResponse-like, has the same qhistory attribute, and evaluates to False when cast to bool (such that the intuitive `if response` works as expected).
hace 4 años
JustAnotherArchivist
e69527c715
Add defaultResponseHandler on the Item level
hace 4 años
JustAnotherArchivist
03336e4988
Add item to response handler arguments (e.g. for logging)
hace 4 años
JustAnotherArchivist
6bdcfe71f0
Refactor database creation and item generation: call `Item.generate()` on every qwarc run and dedupe its output, allowing the addition of further items by modifying the spec file
hace 4 años
JustAnotherArchivist
c878241f24
Switch from concurrent.futures.CancelledError to asyncio.CancelledError
Since Python 3.8, the latter does not inherit from the former anymore.
hace 4 años
JustAnotherArchivist
749158b97a
Use the Future's result directly rather than awaiting again
The asyncio documentation does not specify whether awaiting a Future multiple times is supported or not: https://bugs.python.org/issue41275
hace 4 años
JustAnotherArchivist
a85e80ffa2
Configurable request timeout
hace 4 años
JustAnotherArchivist
429ac94689
Make it possible to override and remove headers
hace 4 años
JustAnotherArchivist
e40be54578
Document verify_ssl parameter
hace 4 años
JustAnotherArchivist
d3437bde19
Move default headers to qwarc.const
hace 4 años
JustAnotherArchivist
1678075a89
Log traceback on exceptions raised from an item
hace 4 años
JustAnotherArchivist
b1a1c03f7e
Handle STOP file and high memory usage before full disk to allow stopping while the disk is above the limit
hace 4 años
JustAnotherArchivist
dd44d9b174
Adjust logging levels: log individual request failures only at WARNING and cancelled tasks at ERROR level
hace 4 años
JustAnotherArchivist
91035d769c
Catch exceptions in Item.process and mark the items as errors instead of crashing
hace 4 años
JustAnotherArchivist
69984765b3
Fix taskType typo silencing cancellation warnings
hace 4 años
JustAnotherArchivist
c263ad0b03
Return ClientResponse object from fetch only if the retrieval was successful
If an exception was raised and caught, the object is still present in the history.
hace 4 años
JustAnotherArchivist
cb0d11284e
Write only successful retrievals (i.e. ones that don't cause an exception) to WARC
hace 4 años
JustAnotherArchivist
1214409a0b
Flush big responses to a temporary file instead of trying to keep everything in-memory
hace 4 años
JustAnotherArchivist
08d96b37c5
Support deep/multiple inheritance from Item
hace 4 años
JustAnotherArchivist
9d8de13775
Add Item.flush_subitems to flush the new subitems to the database while the item is still being processed
This also renames add_item to add_subitem for clarity.
hace 4 años
JustAnotherArchivist
50b936b18c
Refactor QWARC class to keep relevant variables in instance attributes instead of local variables
hace 4 años
JustAnotherArchivist
c5d8d93166
Remove stray whitespace
hace 4 años
JustAnotherArchivist
7d53577522
Add parameter for disabling SSL/TLS certificate validation
hace 4 años
JustAnotherArchivist
50d46ad51c
Use log filename in the target URI of the log resource record
hace 4 años
JustAnotherArchivist
a5dfd5c805
Write spec file + its dependencies and command line to meta WARC
hace 4 años
JustAnotherArchivist
d751844626
Fix starting another item before stopping on STOP file or memory limit exceedance
hace 5 años
JustAnotherArchivist
2b0778f9b5
Remove leftovers from initial code rewrite
hace 5 años
JustAnotherArchivist
ab22966fef
Add to log which item a message is coming from
hace 5 años
JustAnotherArchivist
6fafd32685
Error when the retries are exceeded
hace 5 años
JustAnotherArchivist
8647d6b396
Use f-strings instead of str.format
hace 5 años
JustAnotherArchivist
5008e6e8cd
Deduplicate items
hace 5 años
JustAnotherArchivist
46c95e2157
Disable decoding the response content
chardet can be very slow (https://github.com/chardet/chardet/issues/29 https://github.com/psf/requests/issues/2359 ) and the decoding may be unnecessary if it's binary content.
hace 5 años
JustAnotherArchivist
ad22a2327a
Support adding headers to individual requests
hace 5 años
JustAnotherArchivist
67076f964c
Add support for POST requests
hace 5 años
JustAnotherArchivist
c1574a06c9
Fix sleep task type
hace 5 años
JustAnotherArchivist
e0ca88c807
Fix reference to get_rss
hace 5 años
JustAnotherArchivist
8a8935810d
Fix references to memory and disk space check methods
hace 5 años
JustAnotherArchivist
be5673cfbf
Add record deduplication within a process
hace 5 años
JustAnotherArchivist
e892a6b6a7
Initial commit
hace 5 años