JustAnotherArchivist
820384fe1e
Stop deduping small responses
For small responses, the additional headers for the revisit outweigh the payload truncation savings. The chosen limit of 100 bytes is completely arbitrary and not backed by any real-world data.
4 anos atrás
JustAnotherArchivist
91035d769c
Catch exceptions in Item.process and mark the items as errors instead of crashing
4 anos atrás
JustAnotherArchivist
69984765b3
Fix taskType typo silencing cancellation warnings
4 anos atrás
JustAnotherArchivist
461cedbbde
Avoid temporary files created by warcio due to not knowing the record payload length
4 anos atrás
JustAnotherArchivist
c263ad0b03
Return ClientResponse object from fetch only if the retrieval was successful
If an exception was raised and caught, the object is still present in the history.
4 anos atrás
JustAnotherArchivist
cb0d11284e
Write only successful retrievals (i.e. ones that don't cause an exception) to WARC
4 anos atrás
JustAnotherArchivist
1214409a0b
Flush big responses to a temporary file instead of trying to keep everything in-memory
4 anos atrás
JustAnotherArchivist
37dbcfad21
Don't write responses to WARC that triggered an exception
For example, if the connection breaks while retrieving a response but after the headers have been parsed, the response body would be incomplete.
4 anos atrás
JustAnotherArchivist
93df9cd18d
Get rid of the temporary extra log file and read the plain file instead
4 anos atrás
JustAnotherArchivist
08c3d55376
Add comment on block digest workaround (cf. f14a664b
)
4 anos atrás
JustAnotherArchivist
413435b7fb
Work around warcio not writing the correct WARC-Profile header for revisit records on WARC/1.1
https://github.com/webrecorder/warcio/issues/94
4 anos atrás
JustAnotherArchivist
08d96b37c5
Support deep/multiple inheritance from Item
4 anos atrás
JustAnotherArchivist
9d8de13775
Add Item.flush_subitems to flush the new subitems to the database while the item is still being processed
This also renames add_item to add_subitem for clarity.
4 anos atrás
JustAnotherArchivist
50b936b18c
Refactor QWARC class to keep relevant variables in instance attributes instead of local variables
4 anos atrás
JustAnotherArchivist
c5d8d93166
Remove stray whitespace
4 anos atrás
JustAnotherArchivist
8ee9b20718
Remove WARC-Target-URI header from warcinfo record
WARC 1.1 specification, section 5.14: "A ‘warcinfo’ record shall not have a WARC-Target-URI field."
4 anos atrás
JustAnotherArchivist
f14a664b1c
Work around warcio not writing a block digest for warcinfo records ( https://github.com/webrecorder/warcio/issues/87 )
The length has to be set manually because otherwise warcio will automatically remove the header again.
4 anos atrás
JustAnotherArchivist
7d53577522
Add parameter for disabling SSL/TLS certificate validation
4 anos atrás
JustAnotherArchivist
7e049423a4
The memory leak has vanished as of CPython 3.7.3
4 anos atrás
JustAnotherArchivist
bd14ab3901
Fix crash due to closing the log handler on reaching the max WARC size
4 anos atrás
JustAnotherArchivist
08117630b0
Remove warcinfo record in each data WARC and refer to the process's warcinfo record in the meta WARC instead
4 anos atrás
JustAnotherArchivist
26aab15605
urn:X-qwarc instead of urn:qwarc
4 anos atrás
JustAnotherArchivist
50d46ad51c
Use log filename in the target URI of the log resource record
4 anos atrás
JustAnotherArchivist
e093211496
Set content type for resource records
4 anos atrás
JustAnotherArchivist
ae46b53401
Always write a WARC-Warcinfo-ID header
4 anos atrás
JustAnotherArchivist
23fcdd4026
Write microsecond dates for request and response records
4 anos atrás
JustAnotherArchivist
3030ad10ab
Mark private API accordingly
4 anos atrás
JustAnotherArchivist
e0b4104d21
Remove log handler before writing log record since that requires closing the stream
4 anos atrás
JustAnotherArchivist
6cfd352f68
Write WARC/1.1 files
4 anos atrás
JustAnotherArchivist
e1ad5c232e
Write warcinfo and resource records in meta WARC on firing up qwarc rather than at the end
4 anos atrás
JustAnotherArchivist
f038cf91db
Fix unfound distribution handling
4 anos atrás
JustAnotherArchivist
a5dfd5c805
Write spec file + its dependencies and command line to meta WARC
4 anos atrás
JustAnotherArchivist
e99e2304c9
Write meta WARC with log file
4 anos atrás
JustAnotherArchivist
d751844626
Fix starting another item before stopping on STOP file or memory limit exceedance
4 anos atrás
JustAnotherArchivist
2b0778f9b5
Remove leftovers from initial code rewrite
4 anos atrás
JustAnotherArchivist
85d78cee13
Add warcinfo record with version information on Python, system, and dependencies
4 anos atrás
JustAnotherArchivist
9eaa7be4c8
Python 3.7 compatibility
4 anos atrás
JustAnotherArchivist
9cff6bd5c1
Only open a WARC file when necessary to avoid producing empty WARCs at the end
4 anos atrás
JustAnotherArchivist
21cf784102
Use setuptools_scm for versioning
4 anos atrás
JustAnotherArchivist
ab22966fef
Add to log which item a message is coming from
5 anos atrás
JustAnotherArchivist
6fafd32685
Error when the retries are exceeded
5 anos atrás
JustAnotherArchivist
8647d6b396
Use f-strings instead of str.format
5 anos atrás
JustAnotherArchivist
5008e6e8cd
Deduplicate items
5 anos atrás
JustAnotherArchivist
46c95e2157
Disable decoding the response content
chardet can be very slow (https://github.com/chardet/chardet/issues/29 https://github.com/psf/requests/issues/2359 ) and the decoding may be unnecessary if it's binary content.
5 anos atrás
JustAnotherArchivist
91cd20f567
Version 0.1.3
5 anos atrás
JustAnotherArchivist
85f6f7bd82
Make qwarc.utils.handle_response_limit_error_retries more useful by passing the deferring handler as an argument
5 anos atrás
JustAnotherArchivist
ad22a2327a
Support adding headers to individual requests
5 anos atrás
JustAnotherArchivist
67076f964c
Add support for POST requests
5 anos atrás
JustAnotherArchivist
57764eb2b0
Version 0.1.2
5 anos atrás
JustAnotherArchivist
2d52e78d85
Fix reference to aiohttp.CientError
5 anos atrás