java - Parsing a chemical formula

Question

Welcome To Ask or Share your Answers For Others

java - Parsing a chemical formula

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:34:50+0000

I have developed a couple of series of articles on how to parse molecular formulas, including more complex formulas like C6H2(NO2)3CH3 .

The most recent is my presentation "PLY and PyParsing" at PyCon2010 where I compare those two Python parsing systems using a molecular formula evaluator as my sample problem. There's even a video of my presentation.

The presentation was based on a three-part series of articles I did developing a molecular formula parser using ANTLR. In part 3 I compare the ANTLR solution to a hand-written regular expression parser and solutions in PLY and PyParsing.

The regexp and PLY solutions were first developed in a two-part series on two ways of writing parsers in Python.

The regexp solution and base ANTLR/PLY/PyParsing solutions, use a regular expression like [A-Z][a-z]?d* to match terms in the formula. This is what @David M suggested.

Here is it worked out in Python

import re

# element_name is: capital letter followed by optional lower-case
# count is: empty string (so the count is 1), or a set of digits
element_pat = re.compile("([A-Z][a-z]?)(d*)")

all_elements = []
for (element_name, count) in element_pat.findall("CH3COOH"):
    if count == "":
        count = 1
    else:
        count = int(count)
    all_elements.extend([element_name] * count)

print all_elements

When I run this (it's hard-coded to use acetic acid, CH3COOH) I get

['C', 'H', 'H', 'H', 'C', 'O', 'O', 'H']

Do note that this short bit of code assumes the molecular formula is correct. If you give it something like "##$%^O2#$$#" then it will ignore the fields it doesn't know about and give ['O', 'O']. If you don't want that then you'll have to make it a bit more robust.

If you want to support more complicated formulas, like C6H2(NO2)3CH3, then you'll need to know a bit about tree data structures, specifically (as @Roman points out), abstract syntax trees (most often called ASTs). That's too complicated to get into here, so see my talk and essays for more details.

Categories

java - Parsing a chemical formula

java - Parsing a chemical formula

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags