I am asking this question even though I have already got a work around (see answers), to save anyone else this same pain.
I required a method to show my dataset to my log4j logger. I did this using:
void org.apache.spark.sql.Dataset.show(int numRows, boolean truncate)
which simply logs to the stdOut. In order to capture the stdOut I did the following (inspiration found somewhere else on stackoverflow):
void myMethod(Dataset<Row> data){
// Save the old System.out
PrintStream originalPrintStream = System.out;
// Tell Java to use your special stream
ByteArrayOutputStream logCollection = new ByteArrayOutputStream();
PrintStream printStreamForCollectingLogs = new PrintStream(logCollection);
System.setOut(printStreamForCollectingLogs);
// Print some output: goes to your special stream
data.show(MAX_DISPLAY_ROWS, false);
// Put things back
System.out.flush();
System.setOut(originalPrintStream);
logger.info("
"+logCollection.toString());
logCollection.reset();
}
This works only once, subsequent calls to the same method for the same dataset will fail to capture anything.
I am using:
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.5</version>
question from:
https://stackoverflow.com/questions/65901329/spark-dataset-show-unable-to-capture-output-multiple-times 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…