I define the following code in order to load a pretrained embedding model:
import gensim
from gensim.models.fasttext import FastText as FT_gensim
import numpy as np
class Loader(object):
cache = {}
emb_dic = {}
count = 0
def __init__(self, filename):
print("|-------------------------------------|")
print ("Welcome to Loader class in python")
print("|-------------------------------------|")
self.fn = filename
@property
def fasttext(self):
if Loader.count == 1:
print("already loaded")
if self.fn not in Loader.cache:
Loader.cache[self.fn] = FT_gensim.load_fasttext_format(self.fn)
Loader.count = Loader.count + 1
return Loader.cache[self.fn]
def map(self, word):
if word not in self.fasttext:
Loader.emb_dic[word] = np.random.uniform(low = 0.0, high = 1.0, size = 300)
return Loader.emb_dic[word]
return self.fasttext[word]
i call this class like :
inputRaw = sc.textFile(inputFile, 3).map(lambda line: (line.split("")[0], line.split("")[1])).map(Loader(modelpath).map)
- Im confusing on How many times the modelpath file will be loaded? I want to be one time loaded per executor and used by all of its cores. My answer for this question is the modelpath will be loades 3 times (=number of partition.). If my answer is right, the disadvantage of such modeling is related to size of file modelpath. Suppose this file is 10 gb and suppose i have 200 partitions. Thus in this case we will need 10*200gb = 2000 with is huge (This solution can only work with low number of partitions.)
Suppose i have an
rdd =(id, sentence) =[(id1, u'patina californian'), (id2, u'virgil american'), (id3', u'frensh'), (id4, u'american')]
and i want to sumup the embedding word vectors for each sentence:
def test(document):
print("document is = {}".format(document))
documentWords = document.split(" ")
features = np.zeros(300)
for word in documentWords:
features = np.add(features, Loader(modelpath).fasttext[word])
return features
def calltest(inputRawSource):
my_rdd = inputRawSource.map(lambda line: (line[0], test(line[1]))).cache()
return my_rdd
In this case how many times the modelpath file will be loaded? Note that i set spark.executor.instances" to 3
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…