I have a small PySpark program that uses xgboost4j and xgboost4j-spark in order to train a given dataset in a spark dataframe form.
The training and saving is done, but It seems I cannot load the model.
Current libraries versions:
- Pyspark 2.4.5
- xgboost4j 0.91
- xgboost4j-spark 0.91
The main process is as follow:
trainingData, testData = data.randomSplit([0.7,0.3])
vectorAssembler = VectorAssembler()
.setInputCols(numeric_features_new)
.setOutputCol(FEATURES)
scaler = MinMaxScaler(inputCol = FEATURES,
outputCol = FEATURES + '_scaler')
assemblerInputCols = FEATURES + '_scaler'
xgb_params = dict(
eta=0.1,
maxDepth=2,
missing=0.0,
objective="binary:logistic",
numRound=5,
numWorkers=1
)
xgb = (
XGBoostClassifier(**xgb_params)
.setFeaturesCol(assemblerInputCols)
.setLabelCol(LABEL)
)
pipeline = Pipeline(stages=[
vectorAssembler,
scaler,
xgb
])
print "training model"
pipline_model = pipeline.fit(trainingData)
print "saving model to S3"
pipline_model.write().overwrite().save(modelOssDir)
print "saved model to S3"
print "Loading model..."
pipline_model = PipelineModel.load(modelOssDir)
The error I get:
Traceback (most recent call last):
File "xgboost.py", line 95, in <module>
pipline_model = PipelineModel.load(modelOssDir)
File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/util.py", line 362, in load
File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/pipeline.py", line 242, in load
File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/util.py", line 304, in load
File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/pipeline.py", line 299, in _from_java
File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/wrapper.py", line 227, in _from_java
File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/wrapper.py", line 221, in __get_class
ImportError: No module named ml.dmlc.xgboost4j.scala.spark
at com.aliyun.odps.cupid.CupidUtil.errMsg2SparkException(CupidUtil.java:50)
at com.aliyun.odps.cupid.CupidUtil.getResult(CupidUtil.java:131)
at com.aliyun.odps.cupid.requestcupid.YarnClientImplUtil.pollAMStatus(YarnClientImplUtil.java:108)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.applicationReportTransform(YarnClientImpl.java:377)
... 12 more
21/01/22 11:39:21 ERROR Client: Application diagnostics message: Failed to contact YARN for application application_1611286494541_745555769.
Exception in thread "main" org.apache.spark.SparkException: Application application_1611286494541_745555769 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1543)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am searching for a long time on net.Including:The answer of Cannot save model using PySpark xgboost4j But no use. Please help or try to give some ideas how to achieve this.
thanks in advance.
question from:
https://stackoverflow.com/questions/65839025/cannot-load-model-using-pyspark-xgboost4j 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…