Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
227 views
in Technique[技术] by (71.8m points)

python 3.x - python3 Unicode to chinese

I have a coding problem, have a json, and now I need to convert the content field to traditional Chinese which may contain emjo and the like, I hope to do it with python3,The json file example is as follows:

"messages": [
    {
      "sender_name": "#20KARELu00e2u0080u0099s u00f0u009fu008eu0088u00f0u009fu0092u009b",
      "timestamp_ms": 1610288228221,
      "content": "u00e6u0088u0091u00e9u009au0094u00e9u009bu00a2",
      "type": "Generic",
      "is_unsent": false
    },
    {
      "sender_name": "#20KARELu00e2u0080u0099s u00f0u009fu008eu0088u00f0u009fu0092u009b",
      "timestamp_ms": 1610288227699,
      "share": {
        "link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
        "share_text": "//
Memorabiliau00f0u009fu0087u00b0u00f0u009fu0087u00b7u00f0u009fu0091u00a9u00e2u0080u008du00f0u009fu0091u00a9u00e2u0080u008du00f0u009fu0091u00a7u00e2u0080u008du00f0u009fu0091u00a7u00f0u009fu0091u00a8u00e2u0080u008du00f0u009fu0091u00a8u00e2u0080u008du00f0u009fu0091u00a6
u00f0u009fu0098u0086u00f0u009fu00a4u00a3u00f0u009fu00a4u00acu00f0u009fu0098u008cu00f0u009fu0098u00b4u00f0u009fu00a4u00a9u00f0u009fu00a4u0093
#191214
#191221",
        "original_content_owner": "_ki.zeng"
      },
      "type": "Share",
      "is_unsent": false
    },
    {
      "sender_name": "#20KARELu00e2u0080u0099s u00f0u009fu008eu0088u00f0u009fu0092u009b",
      "timestamp_ms": 1607742844729,
      "content": "u00e6u0089u00aeu00e7u009eu0093u00e5u00b0u00b1u00e5u00a5u00bd",
      "type": "Generic",
      "is_unsent": false
    }]
question from:https://stackoverflow.com/questions/65838505/python3-unicode-to-chinese

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The data posted isn't valid JSON (at least missing a set of outer curly braces) and was encoded incorrectly. UTF-8 bytes were written as Unicode code points. Ideally correct the original code, but the following will fix the mess you have now, if "input.json" is the original data with the outer curly braces added:

import json

# Read the raw bytes of the data file
with open('input.json','rb') as f:
    raw = f.read()

# There are some newline escapes that shouldn't be converted,
# so double-escape them so the result leaves them escaped.
raw = raw.replace(rb'
',rb'\n')

# Convert all the escape codes to Unicode characters
raw = raw.decode('unicode_escape')

# The characters are really UTF-8 byte values.
# The "latin1" codec translates Unicode code points 1:1 to byte values,
# resulting in a byte string again.
raw = raw.encode('latin1')

# Decode correctly as UTF-8
raw = raw.decode('utf8')

# Now that the JSON is fixed, load it into a Python object
data = json.loads(raw)

# Re-write the JSON correctly.
with open('output.json','w',encoding='utf8') as f:
    json.dump(data,f,ensure_ascii=False,indent=2)

Result:

{
  "messages": [
    {
      "sender_name": "#20KAREL’s ????",
      "timestamp_ms": 1610288228221,
      "content": "我隔離",
      "type": "Generic",
      "is_unsent": false
    },
    {
      "sender_name": "#20KAREL’s ????",
      "timestamp_ms": 1610288227699,
      "share": {
        "link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
        "share_text": "//
Memorabilia???????????????????????
??????????????
#191214
#191221",
        "original_content_owner": "_ki.zeng"
      },
      "type": "Share",
      "is_unsent": false
    },
    {
      "sender_name": "#20KAREL’s ????",
      "timestamp_ms": 1607742844729,
      "content": "扮瞓就好",
      "type": "Generic",
      "is_unsent": false
    }
  ]
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...