I have created one dataframe ordersDF.Below is the schema.
root
|-- order_id: long (nullable = true)
|-- order_date: string (nullable = true)
|-- order_customer_id: long (nullable = true)
|-- order_status: string (nullable = true)
In some places we are using 'order_id', order_id,ordersDF.order_id.It is really confusing when to use which one.
For example.
1)ordersDF.select(order_id).show() -- NameError: name 'order_id' is not defined
ordersDF.where('order_id==9').show() --No error here
2)ordersDF.select('order_id').show() --No error here
3)ordersDF.select(ordersDF.order_id).show()--No error here
4)ordersDF.where('ordersDF.order_id==9').show() --AnalysisException: cannot resolve '`ordersDF.order_id`' given input columns: [order_customer_id, order_date, order_id, order_status]; line 1 pos 0;
question from:
https://stackoverflow.com/questions/65914467/spark-dataframe-clarification-on-select 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…