5579129
(tag: v0.2.8, 0.2)
Support overriding the total fetch timeout by
2021-11-20 23:44:35 +0000
215ac03
(tag: v0.2.7)
Support HEAD requests by
2021-11-19 04:04:38 +0000
8f46225
(tag: v0.2.6)
Replace warcio with own WARC writing implementation by
2021-05-28 21:40:57 +0000
a7d7852
Fix ISO-8859-1-encoded Location header handling by
2021-05-28 19:23:25 +0000
f5c3eb4
(0.2-wip-rm-warcio)
WIP attempt to remove warcio by
2021-05-28 04:49:59 +0000
2e1dc59
(HEAD -> master)
Fix log level of one message by
2020-09-26 01:59:31 +0000
f025c4e
Add extensive debug logging by
2020-07-16 03:36:28 +0000
ce7f8fd
Make optional arguments to fetch kwarg-only by
2020-07-16 03:34:34 +0000
b29db24
Configurable verbosity for log file and stderr by
2020-07-16 03:01:05 +0000
dbe1ed7
"Freeze" log file object before writing to WARC to ensure that further log messages aren't picked up by
2020-07-16 02:35:52 +0000
8ca2a6b
Fix exceptions on journal errors by
2020-07-16 02:34:07 +0000
3c8b45b
Refactor cleanup code by
2020-07-14 05:53:35 +0000
dcd5455
Fix crash on starting a run while the DB is locked by
2020-07-14 05:45:03 +0000
168fa78
Avoid locking the DB when there are no subitems to insert by
2020-07-14 05:43:05 +0000
4484d6c
Add Item representation by
2020-07-14 05:42:40 +0000
5675118
Rename id to id_ to avoid clash with builtin by
2020-07-14 05:42:06 +0000
a1e6937
Replace DB locking with an async context manager by
2020-07-14 03:05:23 +0000
cbcef2f
Add Linux classifier by
2020-07-14 01:16:34 +0000
733506a
Remove obsolete TODO by
2020-07-14 01:14:17 +0000
c7fac0e
Add WARC journalling with rollback on errors by
2020-07-13 05:22:29 +0000
a4cf1a4
Fix str_get_all_between yielding half-overlapping matches by
2020-07-12 19:34:55 +0000
15203bd
Handle redirect traps/loops by
2020-07-12 00:06:17 +0000
f8f5258
Track redirect depth by
2020-07-11 23:48:25 +0000
a3d6fb3
Turn response handlers into kwarg-only functions for easier extendability without breaking existing code by
2020-07-11 23:11:25 +0000
a91cc23
Simplify get_software_info's signature to just the extra dependency packages by
2020-07-11 22:05:30 +0000
6cc4adb
Remove stray TODO by
2020-07-11 21:38:07 +0000
c5604ef
Simplify header merging by
2020-07-11 21:27:48 +0000
59ae118
Add fromResponse parameter for URL completion and automatic Referer header by
2020-07-11 21:26:28 +0000
2324216
Add baseUrl and evaluate incomplete URLs relative to it by
2020-07-11 21:11:54 +0000
b30ccf8
Move response/exception history to ClientResponse.qhistory by
2020-07-11 20:53:07 +0000
e69527c
Add defaultResponseHandler on the Item level by
2020-07-11 20:08:54 +0000
03336e4
Add item to response handler arguments (e.g. for logging) by
2020-07-11 19:52:48 +0000
005999f
Disable aiohttp's Content-Type checking on JSON parsing by default by
2020-07-11 04:17:34 +0000
6bdcfe7
Refactor database creation and item generation: call `Item.generate()` on every qwarc run and dedupe its output, allowing the addition of further items by modifying the spec file by
2020-07-11 03:46:24 +0000
c878241
Switch from concurrent.futures.CancelledError to asyncio.CancelledError by
2020-07-11 02:59:16 +0000
749158b
Use the Future's result directly rather than awaiting again by
2020-07-11 02:44:48 +0000
5c6169e
Bump Python version classifiers by
2020-07-11 01:12:50 +0000
a85e80f
Configurable request timeout by
2020-07-11 01:11:38 +0000
429ac94
Make it possible to override and remove headers by
2020-07-11 01:11:15 +0000
e40be54
Document verify_ssl parameter by
2020-07-11 01:10:05 +0000
d3437bd
Move default headers to qwarc.const by
2020-07-11 01:08:43 +0000
17fc349
(tag: v0.2.5)
Fix infinite loop in workaround for aiohttp issue 4630 by
2020-07-02 14:15:56 +0000
b6003af
(tag: v0.2.4)
Work around aiohttp bug on parsing chunked transfer encoding responses when the buffer ends in an unfortunate spot by
2020-03-16 02:53:02 +0000
1678075
(tag: v0.2.3)
Log traceback on exceptions raised from an item by
2020-01-06 01:43:40 +0000
4ff8b26
Don't close raw data tempfiles until the response gets GC'd by
2020-01-06 01:38:40 +0000
4d9e4d8
(tag: v0.2.2)
Fix ClientResponse._read returning more than nbytes if the entire response fits into the first block fed into the parser by
2019-12-11 02:07:26 +0000
2895f4b
Catch TypeError in Content-Length parsing by
2019-12-11 02:06:53 +0000
8358ba9
Add support for only reading part of the response into memory by
2019-12-11 01:36:07 +0000
939978b
Handle EOF from the HTTP payload parser correctly by
2019-12-11 01:27:56 +0000
b1a1c03
Handle STOP file and high memory usage before full disk to allow stopping while the disk is above the limit by
2019-12-11 01:18:10 +0000
dd44d9b
Adjust logging levels: log individual request failures only at WARNING and cancelled tasks at ERROR level by
2019-12-11 01:11:06 +0000
820384f
Stop deduping small responses by
2019-12-11 00:53:45 +0000
91035d7
Catch exceptions in Item.process and mark the items as errors instead of crashing by
2019-12-11 00:46:56 +0000
6998476
Fix taskType typo silencing cancellation warnings by
2019-12-11 00:44:58 +0000
461cedb
Avoid temporary files created by warcio due to not knowing the record payload length by
2019-12-11 00:26:24 +0000
c263ad0
Return ClientResponse object from fetch only if the retrieval was successful by
2019-12-10 22:20:17 +0000
cb0d112
Write only successful retrievals (i.e. ones that don't cause an exception) to WARC by
2019-12-10 22:08:32 +0000
1214409
Flush big responses to a temporary file instead of trying to keep everything in-memory by
2019-12-10 22:04:47 +0000
37dbcfa
Don't write responses to WARC that triggered an exception by
2019-12-09 23:56:45 +0000
93df9cd
(tag: v0.2.1)
Get rid of the temporary extra log file and read the plain file instead by
2019-09-08 21:05:31 +0000
08c3d55
Add comment on block digest workaround (cf. f14a664b
) by
2019-09-08 20:47:47 +0000
413435b
Work around warcio not writing the correct WARC-Profile header for revisit records on WARC/1.1 by
2019-09-08 20:45:46 +0000
08d96b3
Support deep/multiple inheritance from Item by
2019-09-06 15:06:58 +0000
9d8de13
Add Item.flush_subitems to flush the new subitems to the database while the item is still being processed by
2019-09-06 14:56:11 +0000
50b936b
Refactor QWARC class to keep relevant variables in instance attributes instead of local variables by
2019-09-06 14:01:32 +0000
c5d8d93
Remove stray whitespace by
2019-09-06 13:32:45 +0000
8ee9b20
(tag: v0.2.0)
Remove WARC-Target-URI header from warcinfo record by
2019-08-26 13:35:46 +0000
f14a664
Work around warcio not writing a block digest for warcinfo records (https://github.com/webrecorder/warcio/issues/87) by
2019-08-26 13:21:18 +0000
7d53577
Add parameter for disabling SSL/TLS certificate validation by
2019-08-26 13:05:03 +0000
7e04942
The memory leak has vanished as of CPython 3.7.3 by
2019-08-26 12:50:02 +0000
bd14ab3
Fix crash due to closing the log handler on reaching the max WARC size by
2019-08-26 12:19:00 +0000
0811763
Remove warcinfo record in each data WARC and refer to the process's warcinfo record in the meta WARC instead by
2019-08-26 12:13:27 +0000
26aab15
urn:X-qwarc instead of urn:qwarc by
2019-08-26 12:10:25 +0000
50d46ad
Use log filename in the target URI of the log resource record by
2019-08-26 12:09:33 +0000
e093211
Set content type for resource records by
2019-08-16 14:59:47 +0000
ae46b53
Always write a WARC-Warcinfo-ID header by
2019-08-16 14:46:27 +0000
23fcdd4
Write microsecond dates for request and response records by
2019-08-01 01:07:30 +0000
3030ad1
Mark private API accordingly by
2019-08-01 00:49:17 +0000
e0b4104
Remove log handler before writing log record since that requires closing the stream by
2019-08-01 00:48:31 +0000
6cfd352
Write WARC/1.1 files by
2019-08-01 00:41:05 +0000
e1ad5c2
Write warcinfo and resource records in meta WARC on firing up qwarc rather than at the end by
2019-08-01 00:39:06 +0000
f038cf9
Fix unfound distribution handling by
2019-08-01 00:26:17 +0000
a5dfd5c
Write spec file + its dependencies and command line to meta WARC by
2019-08-01 00:24:42 +0000
e99e230
Write meta WARC with log file by
2019-07-26 17:31:29 +0000
d751844
Fix starting another item before stopping on STOP file or memory limit exceedance by
2019-07-26 15:59:39 +0000
2b0778f
Remove leftovers from initial code rewrite by
2019-07-26 15:54:46 +0000
85d78ce
Add warcinfo record with version information on Python, system, and dependencies by
2019-07-26 15:44:32 +0000
9eaa7be
Python 3.7 compatibility by
2019-07-26 14:31:50 +0000
9cff6bd
Only open a WARC file when necessary to avoid producing empty WARCs at the end by
2019-07-26 14:31:33 +0000
21cf784
Use setuptools_scm for versioning by
2019-07-26 13:53:01 +0000
ab22966
Add to log which item a message is coming from by
2019-07-24 16:01:29 +0000
6fafd32
Error when the retries are exceeded by
2019-07-24 15:39:32 +0000
8647d6b
Use f-strings instead of str.format by
2019-07-24 15:23:22 +0000
5008e6e
Deduplicate items by
2019-07-24 15:07:26 +0000
46c95e2
Disable decoding the response content by
2019-07-24 13:17:25 +0000
91cd20f
(tag: v0.1.3)
Version 0.1.3 by
2019-04-29 17:53:53 +0000
85f6f7b
Make qwarc.utils.handle_response_limit_error_retries more useful by passing the deferring handler as an argument by
2019-04-29 17:52:54 +0000
ad22a23
Support adding headers to individual requests by
2019-04-29 17:51:50 +0000
67076f9
Add support for POST requests by
2019-04-29 17:51:15 +0000
57764eb
(tag: v0.1.2)
Version 0.1.2 by
2019-04-24 23:02:56 +0000