I am currently working in Juypter-Notebook using Spark - Pyspark for a NLP project.
I is very slow and it is not using full ram available.
In terminal if I type free -g, I get the following:
however the modeling is very slow
this is my setup:
# Import the findspark module
import findspark
# Initialize via the full spark path
findspark.init("/usr/local/spark/")
# Import the SparkSession module
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Gets an existing :class:`SparkSession` or, if there is no existing one, creates a
# new one based on the options set in this builder.
spark = SparkSession.builder
.master("local")
.appName("AmazonReviewPredictionTest_data")
.config("spark.executor.memory", "5gb")
.getOrCreate()
# Main entry point for Spark functionality. A SparkContext represents the
# connection to a Spark cluster, and can be used to create :class:`RDD` and
# broadcast variables on that cluster.
sc = spark.sparkContext
sqlContext = SQLContext(sc)
I have also increased config("spark.executor.memory", "5gb") from 1gb to 5gb
is there anything I could do to increase the speed
the textfile is 500mb full of Amazon reviews.
this take hours to execute
but I have a 12GB textfile, would this take weeks?
I think I have 125 GB RAM as shown in figure
question from:
https://stackoverflow.com/questions/66051689/make-spark-pyspark-modeling-faster 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…