I'm writing a parser for the first time. I'm following this tutorial on Pratt parers. I've got it to work, but I've come up with sort of a problem.
The original tutorial is written in Java. I prefer C++, so that's what I wrote mine with. I was able to basically port most of the code to C++ (although, I did make it "mine" in the sense that there are some non-language related differences). The only real problem I have is with this line of code:
public Expression parse(Parser parser, Token token) {
Expression operand = parser.parseExpression();
? return new PrefixExpression(token.getType(), operand);
This works fine in Java (I'm assuming. I've never really worked with Java before, but I assume the guy knows what he's doing), but in C++ not so much. I was able to accomplish the same thing by using pointers like so:
Expression* parse(Parser& parser, Token token) {
Expression* operand = parser.parseExpression();
return new PrefixExpression(token.getType(), operand);
Which (although I am unfamiliar with the semantics of Java) seems to do the exact same thing in C++, only with pointers instead of normal objects.
However, the problem with working with pointers like this is that it gets messy kind of fast. Now it become much easier for everything to work with pointers, which means I have to worry about deallocation, and maybe memory leaks if I don't do it right. It just becomes a mess.
Now, the solution seems easy. I could just return PrefixExpression
like this:
Expression parse(Parser& parser, Token token) {
Expression operand = parser.parseExpression();
return PrefixExpression(token.getType(), operand);
Here's my problem: if I do it like this, I lose the vtable and any extra data in this new Expression
. That's a problem since Expression
is actually just a base class for many types of expressions. Parse
can parse anything it wants to, not just a PrefixExpression
. That's how the original was designed. Generally, I like that design, but, as you can see, it's causing problems. Simply returning a new Expression
right here loses things I need from that object later.
Now, I can try to solve this by returning a reference:
Expression& parse(Parser& parser, Token token) {
// ...
return PrefixExpression(token.getType(), operand);
That solves the vtable and extra data problem, but now that creates a new one. I'm returning a reference to a variable that will be destroyed instantly, which is of no help.
All of this to say, that's why I originally ultimately went with pointers. Pointers let me keep the data I needed later, but they are really hard to work with. I can squeeze by, but personally I'd like something better.
I think I could use std::move
, but I'm not familiar with that enough to be certain I'd be using it properly. If I have to I will, but implementing that properly takes some skill and knowledge I just don't have. Besides, that is a lot of work to rework everything I have to work that way up to this point.
All of that to lead to the main point of my question: can I simply just return a reference to a new object safely? Let me just show an example:
Expression& parse(Parser& parser, Token token) {
//...
return *(new PrefixExpression(token.getType(), operand));
This would be nice and solve most of my problems because, if it does what I think it does, I get a reference to a new object, keep the vtable and extra data, and it doesn't get destroyed immediately. This would let me have my cake and eat it too.
However, my problem is can I actually do this? While I feel I have a good reason to do this, this to me seems very weird. I'm allocating new data inside a function, and expecting it to get deallocated outside the function automatically like any normal variable. Even if that did work, would that behave as I would expect it to outside this function completely? I am scared that this might be invoking undefined behavior or something like that. What does the standard think of this?
Edit: So here is a requested minimal sample:
Expression:
// A (not really pure) purely virtual base class that holds all types of expressions
class Expression {
protected:
const std::string type;
public:
Expression() : type("default") {}
virtual ~Expression() {} //Because I'm dealing with pointers, I *think* I need a virtual destructor here. Otherwise, I don't really need
virtual operator std::string() {
// Since I am working with a parser, I want some way to debug and make sure I'm parsing correctly. This was the easiest.
throw ("ERROR: No conversion to std::string implemented for this expression!");
}
// Keep in mind, I may do several other things here, depending on how I want to use Expression
};
A child Expression
, for Parenthesis:
class Paren : public Expression {
private:
// Again, Pointer is not my preferred way, but this was just easier, since Parse() was returning a pointer anyway.
Expression* value;
public:
Paren(Expression *e) {
// I know this is also sketchy. I should be trying to perform a copy here.
// However, I'm not sure how to do this, since Expression could be anything.
// I just decided to write my code so the new object takes ownership of the pointer. I could and should do better
value = e;
}
virtual operator std::string() {
return "(" + std::string(*value) + ")";
}
// Because again, I'm working with pointers
~Paren() {delete value;}
};
And a parser:
class Parser {
private:
Grammar::Grammar grammar;
public:
// this is just a function that creates a unique identifier for each token.
// Tokens normally have types identifier, number, or symbol.
// This would work, except I'd like to make grammar rules based off
// the type of symbol, not all symbols in general
std::string GetMapKey(Tokenizer::Token token) {
if(token.type == "symbol") return token.value;
return token.type;
}
// the parsing function
Expression * parseExpression(double precedence = 0) {
// the current token
Token token = consume();
// detect and throw an error here if we have no such prefix
if(!grammar.HasPrefix(GetMapKey(token))) {
throw("Error! Invalid grammar! No such prefix operator.");
}
// get a prefix parselet
Grammar::PrefixCallback preParse = grammar.GetPrefixCallback(GetMapKey(token));
// get the left side
Expression * left = preParse(token,*this);
token = peek();
double debug = peekPrecedence();
while(precedence < peekPrecedence() && grammar.HasInfix(GetMapKey(token))) {
// we peeked the token, now we should consume it, now that we know there are no errors
token = consume();
// get the infix parser
Grammar::InfixCallback inParse = grammar.GetInfixCallback(GetMapKey(token));
// and get the in-parsed token
left = inParse(token,left,*this);
}
return left;
}
After I posted the parser code, I realized I should mention that I put all the grammar related stuff into its own class. It just has some nice utilities related to grammar, as well as allows us to write a grammar independent parser and worry about the grammar later:
class Grammar {
public:
// I'm in visual studio 2010, which doesn't seem to like the using type = value; syntax, so this instead
typedef std::function<Expression*(Tokenizer::Token,Parser&)> PrefixCallback;
typedef std::function<Expression*(Tokenizer::Token, Expression*, Parser&)> InfixCallback;
private:
std::map<std::string, PrefixCallback> prefix;
std::map<std::string, InfixCallback> infix;
std::map<std::string, double> infixPrecedence; // we'll use double precedence for more flexabillaty
public:
Grammar() {
prefixBindingPower = std::numeric_limits<double>::max();
}
void RegisterPrefix(std::string key, PrefixCallback c) {
prefix[key] = c;
}
PrefixCallback GetPrefixCallback(std::string key) {
return prefix[key];
}
bool HasPrefix(std::string key) {
return prefix.find(key) != prefix.end();
}
void RegisterInfix(std::string key, InfixCallback c, double p) {
infix[key] = c;
infixPrecedence[key] = p;
}
InfixCallback GetInfixCallback(std::string key) {
return infix[key];
}
double GetInfixPrecedence(std::string key) {
return infixPrecedence[key];
}
bool HasInfix(std::string key) {
return infix.find(key) != infix.end();
}
};
Finally, I probably need to show a parsing callback to complete the set:
Expression* ParenPrefixParselet(Tokenizer::Token token, Parser& parser) {
Expression* value = parser.parseExpression(0);
Expression* parenthesis = new Paren(value); // control of value gets given to our new expression. No need to delete
parser.consume(")");
return parenthesis;
}
That allows me to write a grammar that allows for things in parenthesis like this:
Grammar g;
g.RegisterPrefix("(", &ParenPrefixParselet);
Finally, a main():
int main() {
Grammar g;
g.RegisterPrefix("(", &ParenPrefixParselet);
Parser parser(g);
Expression* e = parser.parseExpression(0);
std::cout << static_cast<std::string>(*e);
return 0;
}
Believe it or not, I think that's pretty minimal. Remember, this is a parser. Keep in mind, that as a minimal example, I plan on it being expanded, but hopefully you get the idea.
See Question&Answers more detail:
os