Commit Graph

  • *
  • *
  • *
  • *
  • | *
  • |/
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • | *
  • |/
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • *
  • 5579129 (tag: v0.2.8, 0.2) Support overriding the total fetch timeout by JustAnotherArchivist 2021-11-20 23:44:35 +0000
  • 215ac03 (tag: v0.2.7) Support HEAD requests by JustAnotherArchivist 2021-11-19 04:04:38 +0000
  • 8f46225 (tag: v0.2.6) Replace warcio with own WARC writing implementation by JustAnotherArchivist 2021-05-28 21:40:57 +0000
  • a7d7852 Fix ISO-8859-1-encoded Location header handling by JustAnotherArchivist 2021-05-28 19:23:25 +0000
  • f5c3eb4 (0.2-wip-rm-warcio) WIP attempt to remove warcio by JustAnotherArchivist 2021-05-28 04:49:59 +0000
  • 2e1dc59 (HEAD -> master) Fix log level of one message by JustAnotherArchivist 2020-09-26 01:59:31 +0000
  • f025c4e Add extensive debug logging by JustAnotherArchivist 2020-07-16 03:36:28 +0000
  • ce7f8fd Make optional arguments to fetch kwarg-only by JustAnotherArchivist 2020-07-16 03:34:34 +0000
  • b29db24 Configurable verbosity for log file and stderr by JustAnotherArchivist 2020-07-16 03:01:05 +0000
  • dbe1ed7 "Freeze" log file object before writing to WARC to ensure that further log messages aren't picked up by JustAnotherArchivist 2020-07-16 02:35:52 +0000
  • 8ca2a6b Fix exceptions on journal errors by JustAnotherArchivist 2020-07-16 02:34:07 +0000
  • 3c8b45b Refactor cleanup code by JustAnotherArchivist 2020-07-14 05:53:35 +0000
  • dcd5455 Fix crash on starting a run while the DB is locked by JustAnotherArchivist 2020-07-14 05:45:03 +0000
  • 168fa78 Avoid locking the DB when there are no subitems to insert by JustAnotherArchivist 2020-07-14 05:43:05 +0000
  • 4484d6c Add Item representation by JustAnotherArchivist 2020-07-14 05:42:40 +0000
  • 5675118 Rename id to id_ to avoid clash with builtin by JustAnotherArchivist 2020-07-14 05:42:06 +0000
  • a1e6937 Replace DB locking with an async context manager by JustAnotherArchivist 2020-07-14 03:05:23 +0000
  • cbcef2f Add Linux classifier by JustAnotherArchivist 2020-07-14 01:16:34 +0000
  • 733506a Remove obsolete TODO by JustAnotherArchivist 2020-07-14 01:14:17 +0000
  • c7fac0e Add WARC journalling with rollback on errors by JustAnotherArchivist 2020-07-13 05:22:29 +0000
  • a4cf1a4 Fix str_get_all_between yielding half-overlapping matches by JustAnotherArchivist 2020-07-12 19:34:55 +0000
  • 15203bd Handle redirect traps/loops by JustAnotherArchivist 2020-07-12 00:06:17 +0000
  • f8f5258 Track redirect depth by JustAnotherArchivist 2020-07-11 23:48:25 +0000
  • a3d6fb3 Turn response handlers into kwarg-only functions for easier extendability without breaking existing code by JustAnotherArchivist 2020-07-11 23:11:25 +0000
  • a91cc23 Simplify get_software_info's signature to just the extra dependency packages by JustAnotherArchivist 2020-07-11 22:05:30 +0000
  • 6cc4adb Remove stray TODO by JustAnotherArchivist 2020-07-11 21:38:07 +0000
  • c5604ef Simplify header merging by JustAnotherArchivist 2020-07-11 21:27:48 +0000
  • 59ae118 Add fromResponse parameter for URL completion and automatic Referer header by JustAnotherArchivist 2020-07-11 21:26:28 +0000
  • 2324216 Add baseUrl and evaluate incomplete URLs relative to it by JustAnotherArchivist 2020-07-11 21:11:54 +0000
  • b30ccf8 Move response/exception history to ClientResponse.qhistory by JustAnotherArchivist 2020-07-11 20:53:07 +0000
  • e69527c Add defaultResponseHandler on the Item level by JustAnotherArchivist 2020-07-11 20:08:54 +0000
  • 03336e4 Add item to response handler arguments (e.g. for logging) by JustAnotherArchivist 2020-07-11 19:52:48 +0000
  • 005999f Disable aiohttp's Content-Type checking on JSON parsing by default by JustAnotherArchivist 2020-07-11 04:17:34 +0000
  • 6bdcfe7 Refactor database creation and item generation: call `Item.generate()` on every qwarc run and dedupe its output, allowing the addition of further items by modifying the spec file by JustAnotherArchivist 2020-07-11 03:46:24 +0000
  • c878241 Switch from concurrent.futures.CancelledError to asyncio.CancelledError by JustAnotherArchivist 2020-07-11 02:59:16 +0000
  • 749158b Use the Future's result directly rather than awaiting again by JustAnotherArchivist 2020-07-11 02:44:48 +0000
  • 5c6169e Bump Python version classifiers by JustAnotherArchivist 2020-07-11 01:12:50 +0000
  • a85e80f Configurable request timeout by JustAnotherArchivist 2020-07-11 01:11:38 +0000
  • 429ac94 Make it possible to override and remove headers by JustAnotherArchivist 2020-07-11 01:11:15 +0000
  • e40be54 Document verify_ssl parameter by JustAnotherArchivist 2020-07-11 01:10:05 +0000
  • d3437bd Move default headers to qwarc.const by JustAnotherArchivist 2020-07-11 01:08:43 +0000
  • 17fc349 (tag: v0.2.5) Fix infinite loop in workaround for aiohttp issue 4630 by JustAnotherArchivist 2020-07-02 14:15:56 +0000
  • b6003af (tag: v0.2.4) Work around aiohttp bug on parsing chunked transfer encoding responses when the buffer ends in an unfortunate spot by JustAnotherArchivist 2020-03-16 02:53:02 +0000
  • 1678075 (tag: v0.2.3) Log traceback on exceptions raised from an item by JustAnotherArchivist 2020-01-06 01:43:40 +0000
  • 4ff8b26 Don't close raw data tempfiles until the response gets GC'd by JustAnotherArchivist 2020-01-06 01:38:40 +0000
  • 4d9e4d8 (tag: v0.2.2) Fix ClientResponse._read returning more than nbytes if the entire response fits into the first block fed into the parser by JustAnotherArchivist 2019-12-11 02:07:26 +0000
  • 2895f4b Catch TypeError in Content-Length parsing by JustAnotherArchivist 2019-12-11 02:06:53 +0000
  • 8358ba9 Add support for only reading part of the response into memory by JustAnotherArchivist 2019-12-11 01:36:07 +0000
  • 939978b Handle EOF from the HTTP payload parser correctly by JustAnotherArchivist 2019-12-11 01:27:56 +0000
  • b1a1c03 Handle STOP file and high memory usage before full disk to allow stopping while the disk is above the limit by JustAnotherArchivist 2019-12-11 01:18:10 +0000
  • dd44d9b Adjust logging levels: log individual request failures only at WARNING and cancelled tasks at ERROR level by JustAnotherArchivist 2019-12-11 01:11:06 +0000
  • 820384f Stop deduping small responses by JustAnotherArchivist 2019-12-11 00:53:45 +0000
  • 91035d7 Catch exceptions in Item.process and mark the items as errors instead of crashing by JustAnotherArchivist 2019-12-11 00:46:56 +0000
  • 6998476 Fix taskType typo silencing cancellation warnings by JustAnotherArchivist 2019-12-11 00:44:58 +0000
  • 461cedb Avoid temporary files created by warcio due to not knowing the record payload length by JustAnotherArchivist 2019-12-11 00:26:24 +0000
  • c263ad0 Return ClientResponse object from fetch only if the retrieval was successful by JustAnotherArchivist 2019-12-10 22:20:17 +0000
  • cb0d112 Write only successful retrievals (i.e. ones that don't cause an exception) to WARC by JustAnotherArchivist 2019-12-10 22:08:32 +0000
  • 1214409 Flush big responses to a temporary file instead of trying to keep everything in-memory by JustAnotherArchivist 2019-12-10 22:04:47 +0000
  • 37dbcfa Don't write responses to WARC that triggered an exception by JustAnotherArchivist 2019-12-09 23:56:45 +0000
  • 93df9cd (tag: v0.2.1) Get rid of the temporary extra log file and read the plain file instead by JustAnotherArchivist 2019-09-08 21:05:31 +0000
  • 08c3d55 Add comment on block digest workaround (cf. f14a664b) by JustAnotherArchivist 2019-09-08 20:47:47 +0000
  • 413435b Work around warcio not writing the correct WARC-Profile header for revisit records on WARC/1.1 by JustAnotherArchivist 2019-09-08 20:45:46 +0000
  • 08d96b3 Support deep/multiple inheritance from Item by JustAnotherArchivist 2019-09-06 15:06:58 +0000
  • 9d8de13 Add Item.flush_subitems to flush the new subitems to the database while the item is still being processed by JustAnotherArchivist 2019-09-06 14:56:11 +0000
  • 50b936b Refactor QWARC class to keep relevant variables in instance attributes instead of local variables by JustAnotherArchivist 2019-09-06 14:01:32 +0000
  • c5d8d93 Remove stray whitespace by JustAnotherArchivist 2019-09-06 13:32:45 +0000
  • 8ee9b20 (tag: v0.2.0) Remove WARC-Target-URI header from warcinfo record by JustAnotherArchivist 2019-08-26 13:35:46 +0000
  • f14a664 Work around warcio not writing a block digest for warcinfo records (https://github.com/webrecorder/warcio/issues/87) by JustAnotherArchivist 2019-08-26 13:21:18 +0000
  • 7d53577 Add parameter for disabling SSL/TLS certificate validation by JustAnotherArchivist 2019-08-26 13:05:03 +0000
  • 7e04942 The memory leak has vanished as of CPython 3.7.3 by JustAnotherArchivist 2019-08-26 12:50:02 +0000
  • bd14ab3 Fix crash due to closing the log handler on reaching the max WARC size by JustAnotherArchivist 2019-08-26 12:19:00 +0000
  • 0811763 Remove warcinfo record in each data WARC and refer to the process's warcinfo record in the meta WARC instead by JustAnotherArchivist 2019-08-26 12:13:27 +0000
  • 26aab15 urn:X-qwarc instead of urn:qwarc by JustAnotherArchivist 2019-08-26 12:10:25 +0000
  • 50d46ad Use log filename in the target URI of the log resource record by JustAnotherArchivist 2019-08-26 12:09:33 +0000
  • e093211 Set content type for resource records by JustAnotherArchivist 2019-08-16 14:59:47 +0000
  • ae46b53 Always write a WARC-Warcinfo-ID header by JustAnotherArchivist 2019-08-16 14:46:27 +0000
  • 23fcdd4 Write microsecond dates for request and response records by JustAnotherArchivist 2019-08-01 01:07:30 +0000
  • 3030ad1 Mark private API accordingly by JustAnotherArchivist 2019-08-01 00:49:17 +0000
  • e0b4104 Remove log handler before writing log record since that requires closing the stream by JustAnotherArchivist 2019-08-01 00:48:31 +0000
  • 6cfd352 Write WARC/1.1 files by JustAnotherArchivist 2019-08-01 00:41:05 +0000
  • e1ad5c2 Write warcinfo and resource records in meta WARC on firing up qwarc rather than at the end by JustAnotherArchivist 2019-08-01 00:39:06 +0000
  • f038cf9 Fix unfound distribution handling by JustAnotherArchivist 2019-08-01 00:26:17 +0000
  • a5dfd5c Write spec file + its dependencies and command line to meta WARC by JustAnotherArchivist 2019-08-01 00:24:42 +0000
  • e99e230 Write meta WARC with log file by JustAnotherArchivist 2019-07-26 17:31:29 +0000
  • d751844 Fix starting another item before stopping on STOP file or memory limit exceedance by JustAnotherArchivist 2019-07-26 15:59:39 +0000
  • 2b0778f Remove leftovers from initial code rewrite by JustAnotherArchivist 2019-07-26 15:54:46 +0000
  • 85d78ce Add warcinfo record with version information on Python, system, and dependencies by JustAnotherArchivist 2019-07-26 15:44:32 +0000
  • 9eaa7be Python 3.7 compatibility by JustAnotherArchivist 2019-07-26 14:31:50 +0000
  • 9cff6bd Only open a WARC file when necessary to avoid producing empty WARCs at the end by JustAnotherArchivist 2019-07-26 14:31:33 +0000
  • 21cf784 Use setuptools_scm for versioning by JustAnotherArchivist 2019-07-26 13:53:01 +0000
  • ab22966 Add to log which item a message is coming from by JustAnotherArchivist 2019-07-24 16:01:29 +0000
  • 6fafd32 Error when the retries are exceeded by JustAnotherArchivist 2019-07-24 15:39:32 +0000
  • 8647d6b Use f-strings instead of str.format by JustAnotherArchivist 2019-07-24 15:23:22 +0000
  • 5008e6e Deduplicate items by JustAnotherArchivist 2019-07-24 15:07:26 +0000
  • 46c95e2 Disable decoding the response content by JustAnotherArchivist 2019-07-24 13:17:25 +0000
  • 91cd20f (tag: v0.1.3) Version 0.1.3 by JustAnotherArchivist 2019-04-29 17:53:53 +0000
  • 85f6f7b Make qwarc.utils.handle_response_limit_error_retries more useful by passing the deferring handler as an argument by JustAnotherArchivist 2019-04-29 17:52:54 +0000
  • ad22a23 Support adding headers to individual requests by JustAnotherArchivist 2019-04-29 17:51:50 +0000
  • 67076f9 Add support for POST requests by JustAnotherArchivist 2019-04-29 17:51:15 +0000
  • 57764eb (tag: v0.1.2) Version 0.1.2 by JustAnotherArchivist 2019-04-24 23:02:56 +0000