Difference between revisions of "Python - making a filter that doesn't choke"
Jump to navigation
Jump to search
(Created page with "This is a note to self that is probably also a useful PSA. Making a traditional UNIX filter command in python that **doesn't** choke on any data is nontrivial. <code> #!/usr...") |
|||
Line 1: | Line 1: | ||
This is a note to self that is probably also a useful PSA. Making a traditional UNIX filter command in python that **doesn't** choke on any data is nontrivial. | This is a note to self that is probably also a useful PSA. Making a traditional UNIX filter command in python that **doesn't** choke on any data is nontrivial. | ||
− | < | + | <pre> |
#!/usr/bin/env python3 | #!/usr/bin/env python3 | ||
Line 64: | Line 64: | ||
for line in stdi: | for line in stdi: | ||
sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape')) | sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape')) | ||
− | </ | + | </pre> |
Revision as of 10:06, 9 January 2018
This is a note to self that is probably also a useful PSA. Making a traditional UNIX filter command in python that **doesn't** choke on any data is nontrivial.
#!/usr/bin/env python3 # How to make a line-based filter in py3 without failing. Note, do not # attempt in py2. import sys # TAKE 1: fails on read: UnicodeDecodeError: 'utf-8' codec can't decode # byte 0xe5 in position 1: invalid continuation byte # # Reason: assumes input is valid utf-8 """ for line in sys.stdin: print(line) """ import io # TAKE 2: fails on print: UnicodeEncodeError: 'utf-8' codec can't encode # character '\udce5' in position 1: surrogates not allowed # # Works with errors='replace' but is then lossy. # # Reason: will now accept any input since the py3 surrogate hack can represent # all bytes, but fails to print it out again since this internal representation # has no implicit conversion to utf-8 """ stdi = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8', errors='surrogateescape') for line in stdi: print(line) """ # TAKE 3: works? except cr/lf # # Reason: data now comes through unchanged via surrogate hack, but newline # conversions are done so it's still lossy """ stdi = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8', errors='surrogateescape') for line in stdi: sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape')) """ # TAKE 4: pure binary, works but you don't pretend it's text and lines # # Reason: this is nice and clean and obviously what you want if you have no # need to reason about it as text or lines """ sys.stdout.buffer.write(sys.stdin.buffer.read()) """ # TAKE 5: works # # Reason: instructed to ignore newline conversions, we now seem to get the # same output as input, while pretending it's text and lines. whew! it's not # clear to me why this is not implicit default for sys.stdin, as it is for # sys.argv and os.environ stdi = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8', errors='surrogateescape', newline='') for line in stdi: sys.stdout.buffer.write(line.encode('utf-8', errors='surrogateescape'))