Difference between revisions of "Python - making a filter that doesn't choke"
Jump to navigation
Jump to search
(Created page with "This is a note to self that is probably also a useful PSA. Making a traditional UNIX filter command in python that **doesn't** choke on any data is nontrivial. <code> #!/usr...") |
(No difference)
|
Revision as of 10:05, 9 January 2018
This is a note to self that is probably also a useful PSA. Making a traditional UNIX filter command in python that **doesn't** choke on any data is nontrivial.
- !/usr/bin/env python3
- How to make a line-based filter in py3 without failing. Note, do not
- attempt in py2.
import sys
- TAKE 1: fails on read: UnicodeDecodeError: 'utf-8' codec can't decode
- byte 0xe5 in position 1: invalid continuation byte
- Reason: assumes input is valid utf-8
"""
for line in sys.stdin:
print(line)
"""
import io
- TAKE 2: fails on print: UnicodeEncodeError: 'utf-8' codec can't encode
- character '\udce5' in position 1: surrogates not allowed
- Works with errors='replace' but is then lossy.
- Reason: will now accept any input since the py3 surrogate hack can represent
- all bytes, but fails to print it out again since this internal representation
- has no implicit conversion to utf-8
"""
stdi = io.TextIOWrapper(sys.stdin.buffer,
encoding='utf-8', errors='surrogateescape')
for line in stdi:
print(line)
"""
- TAKE 3: works? except cr/lf
- Reason: data now comes through unchanged via surrogate hack, but newline
- conversions are done so it's still lossy
"""
stdi = io.TextIOWrapper(sys.stdin.buffer,
encoding='utf-8', errors='surrogateescape')
for line in stdi:
sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape'))
"""
- TAKE 4: pure binary, works but you don't pretend it's text and lines
- Reason: this is nice and clean and obviously what you want if you have no
- need to reason about it as text or lines
"""
sys.stdout.buffer.write(sys.stdin.buffer.read())
"""
- TAKE 5: works
- Reason: instructed to ignore newline conversions, we now seem to get the
- same output as input, while pretending it's text and lines. whew! it's not
- clear to me why this is not implicit default for sys.stdin, as it is for
- sys.argv and os.environ
stdi = io.TextIOWrapper(sys.stdin.buffer,
encoding='utf-8', errors='surrogateescape', newline=)
for line in stdi:
sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape'))