The little things give you away... A collection of various small helper stuff
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

79 lines
3.2 KiB

  1. #!/usr/bin/env python3
  2. # `warc-peek.py` is a small script to help looking into gzipped WARC files without decompressing the entire file.
  3. # It searches a window in the file for gzip's magic bytes `1F 8B`, attempts decompression, compares the result to the expected beginning of a WARC record, and prints all valid offsets.
  4. # These can then be used with e.g. `tail` and `zless` to actually look at the records.
  5. #
  6. # Usage: warc-peek.py WARCFILE OFFSET LENGTH
  7. # Opens `WARCFILE`, reads `LENGTH` bytes starting at `OFFSET` (zero-based), and prints valid WARC record offsets to stdout (one integer per line).
  8. #
  9. # Caveats
  10. # - This script only works with WARCs in which each record is compressed individually.
  11. # This is what the specification recommends and what most tools should generate by default, but there definitely exist valid compressed WARCs which can't be processed in this way.
  12. # - When you want to use `tail -c+OFFSET WARCFILE | zless` to look at the records, keep in mind that `tail` uses one-based indices, i.e. you will have to add one to the indices returned by `warc-peek.py`.
  13. # - `warc-peek.py` will miss valid record offsets in the last 512 bytes of the window.
  14. # This is because a certain length of the compressed data is necessary to be able to decompress it. `warc-peek.py` uses 512 bytes for this and will therefore
  15. # not attempt decompression when `1F 8B` is found in the last 512 bytes of the window. You can increase `LENGTH` to compensate for this if necessary.
  16. import argparse
  17. import io
  18. import logging
  19. import zlib
  20. logger = logging.getLogger('warc-peek')
  21. def finditer(b, sub):
  22. pos = 0
  23. while True:
  24. pos = b.find(sub, pos)
  25. if pos < 0:
  26. break
  27. yield pos
  28. pos += 1
  29. def find_offsets(warcfile, offset, length):
  30. with open(warcfile, 'rb') as fp:
  31. if offset >= 0:
  32. fp.seek(offset)
  33. else:
  34. # Negative offset: go back from EOF and fix offset for correct output
  35. fp.seek(0, io.SEEK_END)
  36. size = fp.tell()
  37. fp.seek(offset, io.SEEK_END)
  38. offset = size + offset
  39. buffer = fp.read(length)
  40. logger.debug('Buffer length: {:d}'.format(len(buffer)))
  41. for pos in finditer(buffer, b'\x1f\x8b'):
  42. logger.debug('Trying relative offset {:d}'.format(pos))
  43. if pos > len(buffer) - 512: # 512 bytes might be a bit too much, but at least it ensures that the decompression will work.
  44. break
  45. try:
  46. dec = zlib.decompressobj(zlib.MAX_WBITS | 32).decompress(buffer[pos:pos+512])
  47. except:
  48. continue
  49. logger.debug('First 100 bytes of decompressed data: {!r}'.format(dec[:100]))
  50. if dec.startswith(b'WARC/1.0\r\n') or dec.startswith(b'WARC/1.1\r\n'):
  51. yield offset + pos
  52. if __name__ == '__main__':
  53. parser = argparse.ArgumentParser()
  54. parser.add_argument('--debug', action = 'store_true', help = 'Enable debug output')
  55. parser.add_argument('warcfile', help = 'A .warc.gz file')
  56. parser.add_argument('offset', type = int, help = 'Zero-based byte offset of the window')
  57. parser.add_argument('length', type = int, help = 'Length in bytes of the window')
  58. args = parser.parse_args()
  59. if args.debug:
  60. logging.basicConfig(
  61. format = '{asctime} {levelname} {name} {message}',
  62. style = '{',
  63. level = logging.DEBUG,
  64. )
  65. for offset in find_offsets(args.warcfile, args.offset, args.length):
  66. print(offset)