Python - making a filter that doesn't choke

From WTFwiki
Revision as of 10:05, 9 January 2018 by Stian (talk | contribs) (Created page with "This is a note to self that is probably also a useful PSA. Making a traditional UNIX filter command in python that **doesn't** choke on any data is nontrivial. <code> #!/usr...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is a note to self that is probably also a useful PSA. Making a traditional UNIX filter command in python that **doesn't** choke on any data is nontrivial.

  1. !/usr/bin/env python3
  1. How to make a line-based filter in py3 without failing. Note, do not
  2. attempt in py2.

import sys

  1. TAKE 1: fails on read: UnicodeDecodeError: 'utf-8' codec can't decode
  2. byte 0xe5 in position 1: invalid continuation byte
  3. Reason: assumes input is valid utf-8

""" for line in sys.stdin:

   print(line)

"""

import io

  1. TAKE 2: fails on print: UnicodeEncodeError: 'utf-8' codec can't encode
  2. character '\udce5' in position 1: surrogates not allowed
  3. Works with errors='replace' but is then lossy.
  4. Reason: will now accept any input since the py3 surrogate hack can represent
  5. all bytes, but fails to print it out again since this internal representation
  6. has no implicit conversion to utf-8

""" stdi = io.TextIOWrapper(sys.stdin.buffer,

                       encoding='utf-8', errors='surrogateescape')

for line in stdi:

   print(line)

"""

  1. TAKE 3: works? except cr/lf
  2. Reason: data now comes through unchanged via surrogate hack, but newline
  3. conversions are done so it's still lossy

""" stdi = io.TextIOWrapper(sys.stdin.buffer,

                       encoding='utf-8', errors='surrogateescape')

for line in stdi:

   sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape'))

"""

  1. TAKE 4: pure binary, works but you don't pretend it's text and lines
  2. Reason: this is nice and clean and obviously what you want if you have no
  3. need to reason about it as text or lines

""" sys.stdout.buffer.write(sys.stdin.buffer.read()) """

  1. TAKE 5: works
  2. Reason: instructed to ignore newline conversions, we now seem to get the
  3. same output as input, while pretending it's text and lines. whew! it's not
  4. clear to me why this is not implicit default for sys.stdin, as it is for
  5. sys.argv and os.environ

stdi = io.TextIOWrapper(sys.stdin.buffer,

                       encoding='utf-8', errors='surrogateescape', newline=)

for line in stdi:

   sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape'))