I am learning how to use spark, but there are things I still don't understand. I have the following code
import urllib
import urllib.request
f = urllib.request.urlretrieve("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)
normal_raw_data = raw_data.filter(lambda x: 'normal.' in x)
normal_raw_data
from time import time
t0 = time()
normal_count = normal_raw_data.count()
tt = time() - t0
print (("There are {} 'normal' interactions").format(normal_count))
print ("Count completed in {} seconds".format(round(tt,3))
I have already created my rdd, but supposedly Sparck works in parallel with multiple nodes. And in the code I don't see that at any time it clarifies the number of nodes in which I want to divide them, and the amount of memory that I'm going to use.
As you can see, I want to count the time it takes to process. To see the difference between working with Spark (and its parallel system) or working as I normally do with Pandas DataFrames
question from:
https://stackoverflow.com/questions/65900723/pysparck-how-to-notify-the-number-of-nodes-that-i-am-going-to-use 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…