python - make spark pyspark modeling faster

Question

Welcome To Ask or Share your Answers For Others

python - make spark pyspark modeling faster

asked Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - make spark pyspark modeling faster

I am currently working in Juypter-Notebook using Spark - Pyspark for a NLP project.

I is very slow and it is not using full ram available.

In terminal if I type free -g, I get the following:

however the modeling is very slow

this is my setup:

# Import the findspark module 
import findspark

# Initialize via the full spark path
findspark.init("/usr/local/spark/")

# Import the SparkSession module
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Gets an existing :class:`SparkSession` or, if there is no existing one, creates a
# new one based on the options set in this builder.
spark = SparkSession.builder 
   .master("local") 
   .appName("AmazonReviewPredictionTest_data") 
   .config("spark.executor.memory", "5gb") 
   .getOrCreate()

# Main entry point for Spark functionality. A SparkContext represents the
# connection to a Spark cluster, and can be used to create :class:`RDD` and
# broadcast variables on that cluster.
sc = spark.sparkContext
sqlContext = SQLContext(sc)

I have also increased config("spark.executor.memory", "5gb") from 1gb to 5gb

is there anything I could do to increase the speed

the textfile is 500mb full of Amazon reviews. this take hours to execute

but I have a 12GB textfile, would this take weeks? I think I have 125 GB RAM as shown in figure

question from:https://stackoverflow.com/questions/66051689/make-spark-pyspark-modeling-faster

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - make spark pyspark modeling faster

python - make spark pyspark modeling faster

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags