Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
98 views
in Technique[技术] by (71.8m points)

python - Pass dataframe column value to function

My situation is that I'm receiving transaction data from a vendor that has a datetime that is in local time but it has no offset. For example, the ModifiedDate column may have a value of

'2020-05-16T15:04:55.7429192+00:00'

I can get the local timezone by pulling some other data together about the store in which the transaction occurs

timezone_local = tz.timezone(tzDf[0]["COUNTRY"] + '/' + tzDf[0]["TIMEZONE"])

I then wrote a function to take those two values and give it the proper timezone:

from datetime import datetime
import dateutil.parser as parser
import pytz as tz

def convert_naive_to_aware(datetime_local_str, timezone_local):
  yy = parser.parse(datetime_local_str).year
  mm = parser.parse(datetime_local_str).month
  dd = parser.parse(datetime_local_str).day
  hh = parser.parse(datetime_local_str).hour
  mm = parser.parse(datetime_local_str).minute
  ss = parser.parse(datetime_local_str).second
#   ms = parser.parse(datetime_local_str).microsecond
#   print('yy:' + str(yy) + ', mm:' + str(mm) + ', dd:' + str(dd) + ', hh:' + str(hh) + ', mm:' + str(mm) + ', ss:' + str(ss))

  aware = datetime(yy,mm,dd,hh,mm,ss,0,timezone_local)
  
  return aware

It works fine when I send it the timestamp as a string in testing but balks when I try to apply it to a dataframe. I presume because I don't yet know the right way to pass the column value as a string. In this case, I'm trying to replace the current ModifiedTime value with the results of the call to the function.

from pyspark.sql import functions as F    
.
.
.
ordersDf = ordersDf.withColumn("ModifiedTime", ( convert_naive_to_aware( F.substring( ordersDf.ModifiedTime, 1, 19 ), timezone_local)),)

Those of you more knowledgeable than I won't be surprised that I received the following error:

TypeError: 'Column' object is not callable

I admit, I'm a bit of a tyro at python and dataframes and I may well be taking the long way 'round. I've attempted a few other things such as ordersDf.ModifiedTime.cast("String"), etc but no luck I'd be grateful for any suggestions.

We're using Azure Databricks, the cluster is Scala 2.11.

question from:https://stackoverflow.com/questions/65876730/pass-dataframe-column-value-to-function

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You need to convert the function into a UDF before you can apply it on a Spark dataframe:

from pyspark.sql import functions as F

# I assume `tzDf` is a pandas dataframe... This syntax wouldn't work with spark.
timezone_local = tz.timezone(tzDf[0]["COUNTRY"] + '/' + tzDf[0]["TIMEZONE"])

# Convert function to UDF
time_udf = F.udf(convert_naive_to_aware)

# Avoid overwriting dataframe variables. Here I appended `2` to the new variable name.
ordersDf2 = ordersDf.withColumn(
    "ModifiedTime",
    convert_naive_to_aware(
        F.substring(ordersDf.ModifiedTime, 1, 19), F.lit(str(timezone_local))
    )
)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...