I have a dataframe in pyspark coming from a View in Bigquery that i import after configuring spark session:
config = pyspark.SparkConf().setAll([('spark.executor.memory', '10g'),('spark.driver.memory', '30G'),
('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0')])
sc = pyspark.SparkContext(conf=config)
spark = SparkSession.builder.master('yarn').appName('base_analitica_entidades').config(conf = conf).getOrCreate()
I read this dataset through:
recomendaveis = spark.read.format("bigquery").option("viewsEnabled", "true").load("resource_group:some_group.someView")
Then I filter a specific column with IsNotNull:
recomendaveis_mid = recomendaveis.filter(recomendaveis["entities_mid"].isNotNull())
This recomendaveis_mid dataset is:
DataFrame[uid: string, revision: bigint, title: string, subtitle: string, access: string, branded_content: boolean, image: string, published_in: date, changed_in: date, entities_extracted_in: string, translation_extracted_in: string, categories_extracted_in: string, bigquery_inserted_in: string, public_url: string, private_url: string, text: string, translation_en: string, authors_name: string, categories_name: string, categories_confidence: double, entities_name: string, entities_type: string, entities_salience: double, entities_mid: string, entities_wikipedia_url: string, named_entities: string, publications: string, body: string, Editoria: string, idmateria: string]
When I try to get minimum date of column published_in with:
recomendaveis_mid.select(F.min("published_in")).collect()
It throws this error:
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for table resource_group:some_group.table is invalid. Filter is '(`entities_mid` IS NOT NULL)'at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
... 14 more
The field published_in
has nothing to do with my filter in entities_mid
and when i try to run the date filter without running the entities_mid
isNotNull my code works fine. Any suggestions? In time:
There is a similar error here but I couldn′t get any other ideas. Thanks in advance