Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
964 views
in Technique[技术] by (71.8m points)

python - How to change the bytes in a file?

I'm making a encryption program and I need to open file in binary mode to access non-ascii and non-printable characters, I need to check if character from a file is letter, number, symbol or unprintable character. That means I have to check 1 by 1 if bytes (when they are decoded to ascii) match any of these characters:

{^9,dzEV=Q4ciT+/s};fnq3BFh% #2!k7>YSU<GyDI]|OC_e.W0M~ua-jR5lv1wA`@8t*xr'K"[P)&b:g$p(mX6Ho?JNZL

I think I could encode these characters above to binary and then compare them with bytes. I don't know how to do this.

P.S. Sorry for bad English and binary misunderstanding. (I hope you know what I mean by bytes, I mean characters in binary mode like this):

x01x00x9ax9cx18x00
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

There are two major string types in Python: bytestrings (a sequence of bytes) that represent binary data and Unicode strings (a sequence of Unicode codepoints) that represent human-readable text. It is simple to convert one into another (?):

unicode_text = bytestring.decode(character_encoding)
bytestring = unicode_text.encode(character_encoding)

If you open a file in binary mode e.g., 'rb' then file.read() returns a bytestring (bytes type):

>>> b'A' == b'x41' == chr(0b1000001).encode()
True

There are several methods that can be used to classify bytes:

  • string methods such as bytes.isdigit():

    >>> b'1'.isdigit()
    True
    
  • string constants such as string.printable

    >>> import string
    >>> b'!' in string.printable.encode()
    True
    
  • regular expressions such as d

    >>> import re
    >>> bool(re.match(br'd+$', b'123'))
    True
    
  • classification functions in curses.ascii module e.g., curses.ascii.isprint()

    >>> from curses import ascii
    >>> bytearray(filter(ascii.isprint, b'123'))
    bytearray(b'123')
    

bytearray is a mutable sequence of bytes — unlike a bytestring you can change it inplace e.g., to lowercase every 3rd byte that is uppercase:

>>> import string
>>> a = bytearray(b'ABCDEF_')
>>> uppercase = string.ascii_uppercase.encode()
>>> a[::3] = [b | 0b0100000 if b in uppercase else b 
...           for b in a[::3]]
>>> a
bytearray(b'aBCdEF_')

Notice: b'ad' are lowercase but b'_' remained the same.


To modify a binary file inplace, you could use mmap module e.g., to lowercase 4th column in every other line in 'file':

#!/usr/bin/env python3
import mmap
import string

uppercase = string.ascii_uppercase.encode()
ncolumn = 3 # select 4th column
with open('file', 'r+b') as file, 
     mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    while True:
        mm.readline()   # ignore every other line
        pos = mm.tell() # remember current position
        if not mm.readline(): # EOF
            break
        if mm[pos + ncolumn] in uppercase:
            mm[pos + ncolumn] |= 0b0100000 # lowercase

Note: Python 2 and 3 APIs differ in this case. The code uses Python 3.

Input

ABCDE1
FGHIJ
ABCDE
FGHI

Output

ABCDE1
FGHiJ
ABCDE
FGHi

Notice: 4th column became lowercase on 2nd and 4h lines.


Typically if you want to change a file: you read from the file, write modifications to a temporary file, and on success you move the temporary file inplace of the original file:

#!/usr/bin/env python3
import os
import string
from tempfile import NamedTemporaryFile

caesar_shift = 3
filename = 'file'

def caesar_bytes(plaintext, shift, alphabet=string.ascii_lowercase.encode()):
    shifted_alphabet = alphabet[shift:] + alphabet[:shift]
    return plaintext.translate(plaintext.maketrans(alphabet, shifted_alphabet))

dest_dir = os.path.dirname(filename)
chunksize = 1 << 15
with open(filename, 'rb') as file, 
     NamedTemporaryFile('wb', dir=dest_dir, delete=False) as tmp_file:
    while True: # encrypt
        chunk = file.read(chunksize)
        if not chunk: # EOF
            break
        tmp_file.write(caesar_bytes(chunk, caesar_shift))
os.replace(tmp_file.name, filename)

Input

abc
def
ABC
DEF

Output

def
ghi
ABC
DEF

To convert the output back, set caesar_shift = -3.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...