If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.
However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.
Given the following representation of data
+---+-------------------------+
|id |timestamp_value |
+---+-------------------------+
|1 |2017-08-01T14:30:00+05:30|
|2 |2017-08-01T14:30:00+06:30|
|3 |2017-08-01T14:30:00+07:30|
+---+-------------------------+
as given by:
l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])
where timestamp_value
is a String
, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):
from pyspark.sql.functions import to_timestamp, date_format
spark.conf.set('spark.sql.session.timeZone', 'UTC')
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssZ"
).alias('timestamp_value'))
which yields:
+------------------------+
|timestamp_value |
+------------------------+
|2017-08-01T09:00:00+0000|
|2017-08-01T08:00:00+0000|
|2017-08-01T07:00:00+0000|
+------------------------+
or, slightly differently:
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias('timestamp_value'))
which yields:
+--------------------+
|timestamp_value |
+--------------------+
|2017-08-01T09:00:00Z|
|2017-08-01T08:00:00Z|
|2017-08-01T07:00:00Z|
+--------------------+