JustAnotherArchivist
|
1214409a0b
|
Flush big responses to a temporary file instead of trying to keep everything in-memory
|
пре 4 година |
JustAnotherArchivist
|
37dbcfad21
|
Don't write responses to WARC that triggered an exception
For example, if the connection breaks while retrieving a response but after the headers have been parsed, the response body would be incomplete.
|
пре 4 година |
JustAnotherArchivist
|
93df9cd18d
|
Get rid of the temporary extra log file and read the plain file instead
|
пре 4 година |
JustAnotherArchivist
|
08c3d55376
|
Add comment on block digest workaround (cf. f14a664b )
|
пре 4 година |
JustAnotherArchivist
|
413435b7fb
|
Work around warcio not writing the correct WARC-Profile header for revisit records on WARC/1.1
https://github.com/webrecorder/warcio/issues/94
|
пре 4 година |
JustAnotherArchivist
|
08d96b37c5
|
Support deep/multiple inheritance from Item
|
пре 4 година |
JustAnotherArchivist
|
9d8de13775
|
Add Item.flush_subitems to flush the new subitems to the database while the item is still being processed
This also renames add_item to add_subitem for clarity.
|
пре 4 година |
JustAnotherArchivist
|
50b936b18c
|
Refactor QWARC class to keep relevant variables in instance attributes instead of local variables
|
пре 4 година |
JustAnotherArchivist
|
c5d8d93166
|
Remove stray whitespace
|
пре 4 година |
JustAnotherArchivist
|
8ee9b20718
|
Remove WARC-Target-URI header from warcinfo record
WARC 1.1 specification, section 5.14: "A ‘warcinfo’ record shall not have a WARC-Target-URI field."
|
пре 4 година |
JustAnotherArchivist
|
f14a664b1c
|
Work around warcio not writing a block digest for warcinfo records (https://github.com/webrecorder/warcio/issues/87)
The length has to be set manually because otherwise warcio will automatically remove the header again.
|
пре 4 година |
JustAnotherArchivist
|
7d53577522
|
Add parameter for disabling SSL/TLS certificate validation
|
пре 4 година |
JustAnotherArchivist
|
7e049423a4
|
The memory leak has vanished as of CPython 3.7.3
|
пре 4 година |
JustAnotherArchivist
|
bd14ab3901
|
Fix crash due to closing the log handler on reaching the max WARC size
|
пре 4 година |
JustAnotherArchivist
|
08117630b0
|
Remove warcinfo record in each data WARC and refer to the process's warcinfo record in the meta WARC instead
|
пре 4 година |
JustAnotherArchivist
|
26aab15605
|
urn:X-qwarc instead of urn:qwarc
|
пре 4 година |
JustAnotherArchivist
|
50d46ad51c
|
Use log filename in the target URI of the log resource record
|
пре 4 година |
JustAnotherArchivist
|
e093211496
|
Set content type for resource records
|
пре 4 година |
JustAnotherArchivist
|
ae46b53401
|
Always write a WARC-Warcinfo-ID header
|
пре 4 година |
JustAnotherArchivist
|
23fcdd4026
|
Write microsecond dates for request and response records
|
пре 4 година |
JustAnotherArchivist
|
3030ad10ab
|
Mark private API accordingly
|
пре 4 година |
JustAnotherArchivist
|
e0b4104d21
|
Remove log handler before writing log record since that requires closing the stream
|
пре 4 година |
JustAnotherArchivist
|
6cfd352f68
|
Write WARC/1.1 files
|
пре 4 година |
JustAnotherArchivist
|
e1ad5c232e
|
Write warcinfo and resource records in meta WARC on firing up qwarc rather than at the end
|
пре 4 година |
JustAnotherArchivist
|
f038cf91db
|
Fix unfound distribution handling
|
пре 4 година |
JustAnotherArchivist
|
a5dfd5c805
|
Write spec file + its dependencies and command line to meta WARC
|
пре 4 година |
JustAnotherArchivist
|
e99e2304c9
|
Write meta WARC with log file
|
пре 4 година |
JustAnotherArchivist
|
d751844626
|
Fix starting another item before stopping on STOP file or memory limit exceedance
|
пре 4 година |
JustAnotherArchivist
|
2b0778f9b5
|
Remove leftovers from initial code rewrite
|
пре 4 година |
JustAnotherArchivist
|
85d78cee13
|
Add warcinfo record with version information on Python, system, and dependencies
|
пре 4 година |
JustAnotherArchivist
|
9eaa7be4c8
|
Python 3.7 compatibility
|
пре 4 година |
JustAnotherArchivist
|
9cff6bd5c1
|
Only open a WARC file when necessary to avoid producing empty WARCs at the end
|
пре 4 година |
JustAnotherArchivist
|
21cf784102
|
Use setuptools_scm for versioning
|
пре 4 година |
JustAnotherArchivist
|
ab22966fef
|
Add to log which item a message is coming from
|
пре 5 година |
JustAnotherArchivist
|
6fafd32685
|
Error when the retries are exceeded
|
пре 5 година |
JustAnotherArchivist
|
8647d6b396
|
Use f-strings instead of str.format
|
пре 5 година |
JustAnotherArchivist
|
5008e6e8cd
|
Deduplicate items
|
пре 5 година |
JustAnotherArchivist
|
46c95e2157
|
Disable decoding the response content
chardet can be very slow (https://github.com/chardet/chardet/issues/29 https://github.com/psf/requests/issues/2359) and the decoding may be unnecessary if it's binary content.
|
пре 5 година |
JustAnotherArchivist
|
91cd20f567
|
Version 0.1.3
|
пре 5 година |
JustAnotherArchivist
|
85f6f7bd82
|
Make qwarc.utils.handle_response_limit_error_retries more useful by passing the deferring handler as an argument
|
пре 5 година |
JustAnotherArchivist
|
ad22a2327a
|
Support adding headers to individual requests
|
пре 5 година |
JustAnotherArchivist
|
67076f964c
|
Add support for POST requests
|
пре 5 година |
JustAnotherArchivist
|
57764eb2b0
|
Version 0.1.2
|
пре 5 година |
JustAnotherArchivist
|
2d52e78d85
|
Fix reference to aiohttp.CientError
|
пре 5 година |
JustAnotherArchivist
|
0f107e988d
|
Version 0.1.1
|
пре 5 година |
JustAnotherArchivist
|
c1574a06c9
|
Fix sleep task type
|
пре 5 година |
JustAnotherArchivist
|
e0ca88c807
|
Fix reference to get_rss
|
пре 5 година |
JustAnotherArchivist
|
984d28ede0
|
Fix type of --memorylimit, --disklimit, and --warcsplit values
|
пре 5 година |
JustAnotherArchivist
|
8a8935810d
|
Fix references to memory and disk space check methods
|
пре 5 година |
JustAnotherArchivist
|
1c8983fc1e
|
Version 0.1.0
|
пре 5 година |