It is not always possible to extract text from a PDF especially when the /ToUnicode map is missing as pointed out by mkl.
If it is not possible to cut and paste the correct text from Acrobat then you will have very little chance of extracting the text yourself. If Acrobat cannot extract it then it is very unlikely that any other tool can extract the text correctly.
If you manually create an encoding table then you could use this to remap the extracted characters to their correct values but this most likely will only work for this one document.
Often this is done on purpose. I have seen documents that randomly remap characters differently for each font in the dot. It is used as a form of obfuscation and the only real way to extract text from these PDF's is to resort to OCR. There are many financial reports that use this type of trick to stop people from extracting their data.
Also, Identity-H is just a 1:1 character mapping for all characters from 0x0000 to 0xFFFF. ie. Identity is an identity mapping.
Your real problem is the missing /ToUnicode entry in this PDF. I suspect there is also an embedded CMap in your PDF that explains why there could be 3 bytes per character.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…