Python - making a filter that doesn't choke

From WTFwiki
Revision as of 18:24, 9 January 2018 by Stian (talk | contribs) (Undo revision 1258 by Stian (talk))
Jump to navigation Jump to search

This is a note to self that is probably also a useful PSA. Making a traditional UNIX filter command in python that **doesn't** choke on any data is nontrivial.

#!/usr/bin/env python3

# How to make a line-based filter in py3 without failing.  Note, do not
# attempt in py2.

import sys

# TAKE 1: fails on read: UnicodeDecodeError: 'utf-8' codec can't decode
# byte 0xe5 in position 1: invalid continuation byte
#
# Reason: assumes input is valid utf-8
"""
for line in sys.stdin:
    print(line)
"""

import io

# TAKE 2: fails on print: UnicodeEncodeError: 'utf-8' codec can't encode
# character '\udce5' in position 1: surrogates not allowed
#
# Works with errors='replace' but is then lossy.
#
# Reason: will now accept any input since the py3 surrogate hack can represent
# all bytes, but fails to print it out again since this internal representation
# has no implicit conversion to utf-8
"""
stdi = io.TextIOWrapper(sys.stdin.buffer,
                        encoding='utf-8', errors='surrogateescape')
for line in stdi:
    print(line)
"""

# TAKE 3: works?  except cr/lf
#
# Reason: data now comes through unchanged via surrogate hack, but newline
# conversions are done so it's still lossy
"""
stdi = io.TextIOWrapper(sys.stdin.buffer,
                        encoding='utf-8', errors='surrogateescape')
for line in stdi:
    sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape'))
"""

# TAKE 4: pure binary, works but you don't pretend it's text and lines
#
# Reason: this is nice and clean and obviously what you want if you have no
# need to reason about it as text or lines
"""
sys.stdout.buffer.write(sys.stdin.buffer.read())
"""

# TAKE 5: works
#
# Reason: instructed to ignore newline conversions, we now seem to get the
# same output as input, while pretending it's text and lines.  whew!  it's not
# clear to me why this is not implicit default for sys.stdin, as it is for
# sys.argv and os.environ
stdi = io.TextIOWrapper(sys.stdin.buffer,
                        encoding='utf-8', errors='surrogateescape', newline='')
for line in stdi:
    sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape'))