parsing - Use Scala parser combinator to parse CSV files

Question

Welcome To Ask or Share your Answers For Others

parsing - Use Scala parser combinator to parse CSV files

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

parsing - Use Scala parser combinator to parse CSV files

I'm trying to write a CSV parser using Scala parser combinators. The grammar is based on RFC4180. I came up with the following code. It almost works, but I cannot get it to correctly separate different records. What did I miss?

object CSV extends RegexParsers {
  def COMMA   = ","
  def DQUOTE  = """
  def DQUOTE2 = """" ^^ { case _ => """ }
  def CR      = "
"
  def LF      = "
"
  def CRLF    = "
"
  def TXT     = "[^",
]".r
  
  def file: Parser[List[List[String]]] = ((record~((CRLF~>record)*))<~(CRLF?)) ^^ { 
    case r~rs => r::rs
  }
  def record: Parser[List[String]] = (field~((COMMA~>field)*)) ^^ {
    case f~fs => f::fs
  }
  def field: Parser[String] = escaped|nonescaped
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}


println(CSV.parse(""" "foo", "bar", 123""" + "
" + 
  "hello, world, 456" + "
" +
  """ spam, 789, egg"""))

// Output: List(List(foo, bar, 123hello, world, 456spam, 789, egg)) 
// Expected: List(List(foo, bar, 123), List(hello, world, 456), List(spam, 789, egg))

Update: problem solved

The default RegexParsers ignore whitespaces including space, tab, carriage return, and line breaks using the regular expression [s]+. The problem of the parser above unable to separate records is due to this. We need to disable skipWhitespace mode. Replacing whiteSpace definition to just [ ]} does not solve the problem because it will ignore all spaces within fields (thus "foo bar" in the CSV becomes "foobar"), which is undesired. The updated source of the parser is thus

import scala.util.parsing.combinator._

// A CSV parser based on RFC4180
// https://www.rfc-editor.org/rfc/rfc4180

object CSV extends RegexParsers {
  override val skipWhitespace = false   // meaningful spaces in CSV

  def COMMA   = ","
  def DQUOTE  = """
  def DQUOTE2 = """" ^^ { case _ => """ }  // combine 2 dquotes into 1
  def CRLF    = "
" | "
"
  def TXT     = "[^",
]".r
  def SPACES  = "[ ]+".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ (CRLF?)

  def record: Parser[List[String]] = repsep(field, COMMA)

  def field: Parser[String] = escaped|nonescaped


  def escaped: Parser[String] = {
    ((SPACES?)~>DQUOTE~>((TXT|COMMA|CRLF|DQUOTE2)*)<~DQUOTE<~(SPACES?)) ^^ { 
      case ls => ls.mkString("")
    }
  }

  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }



  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case e => throw new Exception(e.toString)
  }
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:13:22+0000

What you missed is whitespace. I threw in a couple bonus improvements.

import scala.util.parsing.combinator._

object CSV extends RegexParsers {
  override protected val whiteSpace = """[ ]""".r

  def COMMA   = ","
  def DQUOTE  = """
  def DQUOTE2 = """" ^^ { case _ => """ }
  def CR      = "
"
  def LF      = "
"
  def CRLF    = "
"
  def TXT     = "[^",
]".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ opt(CRLF)
  def record: Parser[List[String]] = rep1sep(field, COMMA)
  def field: Parser[String] = (escaped|nonescaped)
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}

Categories

parsing - Use Scala parser combinator to parse CSV files

parsing - Use Scala parser combinator to parse CSV files

Update: problem solved

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags