JustAnotherArchivist
a91cc23d47
Simplify get_software_info's signature to just the extra dependency packages
As a consequence, SpecDependencies.extra can now be any data type that can be put into JSON; unhashable types previously caused a crash due to the lru_cache.
4 anos atrás
JustAnotherArchivist
6cc4adb901
Remove stray TODO
The DB creation operates with a DB lock, so that code can't run while another process is filling the DB; it would block on obtaining the lock a few lines prior instead.
4 anos atrás
JustAnotherArchivist
c5604ef965
Simplify header merging
4 anos atrás
JustAnotherArchivist
59ae1183d2
Add fromResponse parameter for URL completion and automatic Referer header
4 anos atrás
JustAnotherArchivist
2324216016
Add baseUrl and evaluate incomplete URLs relative to it
4 anos atrás
JustAnotherArchivist
b30ccf8bf8
Move response/exception history to ClientResponse.qhistory
It is rarely necessary to access the history, and the tuple return value clutters the spec file code.
As a consequence, it's no longer possible to return None if an error occurred without losing the history.
To replace that, this also introduces a DummyClientResponse, which is kind of ClientResponse-like, has the same qhistory attribute, and evaluates to False when cast to bool (such that the intuitive `if response` works as expected).
4 anos atrás
JustAnotherArchivist
e69527c715
Add defaultResponseHandler on the Item level
4 anos atrás
JustAnotherArchivist
03336e4988
Add item to response handler arguments (e.g. for logging)
4 anos atrás
JustAnotherArchivist
005999fcb9
Disable aiohttp's Content-Type checking on JSON parsing by default
4 anos atrás
JustAnotherArchivist
6bdcfe71f0
Refactor database creation and item generation: call `Item.generate()` on every qwarc run and dedupe its output, allowing the addition of further items by modifying the spec file
4 anos atrás
JustAnotherArchivist
c878241f24
Switch from concurrent.futures.CancelledError to asyncio.CancelledError
Since Python 3.8, the latter does not inherit from the former anymore.
4 anos atrás
JustAnotherArchivist
749158b97a
Use the Future's result directly rather than awaiting again
The asyncio documentation does not specify whether awaiting a Future multiple times is supported or not: https://bugs.python.org/issue41275
4 anos atrás
JustAnotherArchivist
a85e80ffa2
Configurable request timeout
4 anos atrás
JustAnotherArchivist
429ac94689
Make it possible to override and remove headers
4 anos atrás
JustAnotherArchivist
e40be54578
Document verify_ssl parameter
4 anos atrás
JustAnotherArchivist
d3437bde19
Move default headers to qwarc.const
4 anos atrás
JustAnotherArchivist
17fc3499ff
Fix infinite loop in workaround for aiohttp issue 4630
If a response ends with '0\r\n' or '0\r\n\r', ClientResponse._read loops forever trying to read 4 more bytes.
In addition, bump that read to 1 KiB for better worst-case performance.
4 anos atrás
JustAnotherArchivist
b6003af1e5
Work around aiohttp bug on parsing chunked transfer encoding responses when the buffer ends in an unfortunate spot
https://github.com/aio-libs/aiohttp/issues/4630
4 anos atrás
JustAnotherArchivist
1678075a89
Log traceback on exceptions raised from an item
4 anos atrás
JustAnotherArchivist
4ff8b260a1
Don't close raw data tempfiles until the response gets GC'd
Closing the raw data tempfiles immediately on connection reuse caused any response reading to fail with an I/O error if another request started on the same connection in the meantime. Delaying the closing until the response object falls out of scope and gets GC'd ensures that as long as there is a reference to that object, it can be read from, at the expense of a possibly larger memory overhead.
4 anos atrás
JustAnotherArchivist
4d9e4d8fe8
Fix ClientResponse._read returning more than nbytes if the entire response fits into the first block fed into the parser
4 anos atrás
JustAnotherArchivist
2895f4bfdf
Catch TypeError in Content-Length parsing
4 anos atrás
JustAnotherArchivist
8358ba9131
Add support for only reading part of the response into memory
4 anos atrás
JustAnotherArchivist
939978beec
Handle EOF from the HTTP payload parser correctly
Note that this should never matter anyway because the response is already run through the payload parser before.
4 anos atrás
JustAnotherArchivist
b1a1c03f7e
Handle STOP file and high memory usage before full disk to allow stopping while the disk is above the limit
4 anos atrás
JustAnotherArchivist
dd44d9b174
Adjust logging levels: log individual request failures only at WARNING and cancelled tasks at ERROR level
4 anos atrás
JustAnotherArchivist
820384fe1e
Stop deduping small responses
For small responses, the additional headers for the revisit outweigh the payload truncation savings. The chosen limit of 100 bytes is completely arbitrary and not backed by any real-world data.
4 anos atrás
JustAnotherArchivist
91035d769c
Catch exceptions in Item.process and mark the items as errors instead of crashing
4 anos atrás
JustAnotherArchivist
69984765b3
Fix taskType typo silencing cancellation warnings
4 anos atrás
JustAnotherArchivist
461cedbbde
Avoid temporary files created by warcio due to not knowing the record payload length
4 anos atrás
JustAnotherArchivist
c263ad0b03
Return ClientResponse object from fetch only if the retrieval was successful
If an exception was raised and caught, the object is still present in the history.
4 anos atrás
JustAnotherArchivist
cb0d11284e
Write only successful retrievals (i.e. ones that don't cause an exception) to WARC
4 anos atrás
JustAnotherArchivist
1214409a0b
Flush big responses to a temporary file instead of trying to keep everything in-memory
4 anos atrás
JustAnotherArchivist
37dbcfad21
Don't write responses to WARC that triggered an exception
For example, if the connection breaks while retrieving a response but after the headers have been parsed, the response body would be incomplete.
4 anos atrás
JustAnotherArchivist
93df9cd18d
Get rid of the temporary extra log file and read the plain file instead
4 anos atrás
JustAnotherArchivist
08c3d55376
Add comment on block digest workaround (cf. f14a664b
)
4 anos atrás
JustAnotherArchivist
413435b7fb
Work around warcio not writing the correct WARC-Profile header for revisit records on WARC/1.1
https://github.com/webrecorder/warcio/issues/94
4 anos atrás
JustAnotherArchivist
08d96b37c5
Support deep/multiple inheritance from Item
4 anos atrás
JustAnotherArchivist
9d8de13775
Add Item.flush_subitems to flush the new subitems to the database while the item is still being processed
This also renames add_item to add_subitem for clarity.
4 anos atrás
JustAnotherArchivist
50b936b18c
Refactor QWARC class to keep relevant variables in instance attributes instead of local variables
4 anos atrás
JustAnotherArchivist
c5d8d93166
Remove stray whitespace
4 anos atrás
JustAnotherArchivist
8ee9b20718
Remove WARC-Target-URI header from warcinfo record
WARC 1.1 specification, section 5.14: "A ‘warcinfo’ record shall not have a WARC-Target-URI field."
4 anos atrás
JustAnotherArchivist
f14a664b1c
Work around warcio not writing a block digest for warcinfo records ( https://github.com/webrecorder/warcio/issues/87 )
The length has to be set manually because otherwise warcio will automatically remove the header again.
4 anos atrás
JustAnotherArchivist
7d53577522
Add parameter for disabling SSL/TLS certificate validation
4 anos atrás
JustAnotherArchivist
7e049423a4
The memory leak has vanished as of CPython 3.7.3
4 anos atrás
JustAnotherArchivist
bd14ab3901
Fix crash due to closing the log handler on reaching the max WARC size
4 anos atrás
JustAnotherArchivist
08117630b0
Remove warcinfo record in each data WARC and refer to the process's warcinfo record in the meta WARC instead
4 anos atrás
JustAnotherArchivist
26aab15605
urn:X-qwarc instead of urn:qwarc
4 anos atrás
JustAnotherArchivist
50d46ad51c
Use log filename in the target URI of the log resource record
4 anos atrás
JustAnotherArchivist
e093211496
Set content type for resource records
4 anos atrás