The little things give you away... A collection of various small helper stuff
Vous ne pouvez pas sélectionner plus de 25 sujets Les noms de sujets doivent commencer par une lettre ou un nombre, peuvent contenir des tirets ('-') et peuvent comporter jusqu'à 35 caractères.
 
 
 

79 lignes
3.2 KiB

  1. #!/usr/bin/env python3
  2. # `warc-peek.py` is a small script to help looking into gzipped WARC files without decompressing the entire file.
  3. # It searches a window in the file for gzip's magic bytes `1F 8B`, attempts decompression, compares the result to the expected beginning of a WARC record, and prints all valid offsets.
  4. # These can then be used with e.g. `tail` and `zless` to actually look at the records.
  5. #
  6. # Usage: warc-peek.py WARCFILE OFFSET LENGTH
  7. # Opens `WARCFILE`, reads `LENGTH` bytes starting at `OFFSET` (zero-based), and prints valid WARC record offsets to stdout (one integer per line).
  8. #
  9. # Caveats
  10. # - This script only works with WARCs in which each record is compressed individually.
  11. # This is what the specification recommends and what most tools should generate by default, but there definitely exist valid compressed WARCs which can't be processed in this way.
  12. # - When you want to use `tail -c+OFFSET WARCFILE | zless` to look at the records, keep in mind that `tail` uses one-based indices, i.e. you will have to add one to the indices returned by `warc-peek.py`.
  13. # - `warc-peek.py` will miss valid record offsets in the last 512 bytes of the window.
  14. # This is because a certain length of the compressed data is necessary to be able to decompress it. `warc-peek.py` uses 512 bytes for this and will therefore
  15. # not attempt decompression when `1F 8B` is found in the last 512 bytes of the window. You can increase `LENGTH` to compensate for this if necessary.
  16. import argparse
  17. import io
  18. import logging
  19. import zlib
  20. logger = logging.getLogger('warc-peek')
  21. def finditer(b, sub):
  22. pos = 0
  23. while True:
  24. pos = b.find(sub, pos)
  25. if pos < 0:
  26. break
  27. yield pos
  28. pos += 1
  29. def find_offsets(warcfile, offset, length):
  30. with open(warcfile, 'rb') as fp:
  31. if offset >= 0:
  32. fp.seek(offset)
  33. else:
  34. # Negative offset: go back from EOF and fix offset for correct output
  35. fp.seek(0, io.SEEK_END)
  36. size = fp.tell()
  37. fp.seek(offset, io.SEEK_END)
  38. offset = size + offset
  39. buffer = fp.read(length)
  40. logger.debug('Buffer length: {:d}'.format(len(buffer)))
  41. for pos in finditer(buffer, b'\x1f\x8b'):
  42. logger.debug('Trying relative offset {:d}'.format(pos))
  43. if pos > len(buffer) - 512: # 512 bytes might be a bit too much, but at least it ensures that the decompression will work.
  44. break
  45. try:
  46. dec = zlib.decompressobj(zlib.MAX_WBITS | 32).decompress(buffer[pos:pos+512])
  47. except:
  48. continue
  49. logger.debug('First 100 bytes of decompressed data: {!r}'.format(dec[:100]))
  50. if dec.startswith(b'WARC/1.0\r\n') or dec.startswith(b'WARC/1.1\r\n'):
  51. yield offset + pos
  52. if __name__ == '__main__':
  53. parser = argparse.ArgumentParser()
  54. parser.add_argument('--debug', action = 'store_true', help = 'Enable debug output')
  55. parser.add_argument('warcfile', help = 'A .warc.gz file')
  56. parser.add_argument('offset', type = int, help = 'Zero-based byte offset of the window')
  57. parser.add_argument('length', type = int, help = 'Length in bytes of the window')
  58. args = parser.parse_args()
  59. if args.debug:
  60. logging.basicConfig(
  61. format = '{asctime} {levelname} {name} {message}',
  62. style = '{',
  63. level = logging.DEBUG,
  64. )
  65. for offset in find_offsets(args.warcfile, args.offset, args.length):
  66. print(offset)