Since Spark 2.4 you can use slice
function. In Python):
pyspark.sql.functions.slice(x, start, length)
Collection function: returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
...
New in version 2.4.
from pyspark.sql.functions import slice
df = spark.createDataFrame([
(10, "Finance", ["Jon", "Snow", "Castle", "Black", "Ned"]),
(20, "IT", ["Ned", "is", "no", "more"])
], ("dept_id", "dept_nm", "emp_details"))
df.select(slice("emp_details", 1, 3).alias("empt_details")).show()
+-------------------+
| empt_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
In Scala
def slice(x: Column, start: Int, length: Int): Column
Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
import org.apache.spark.sql.functions.slice
val df = Seq(
(10, "Finance", Seq("Jon", "Snow", "Castle", "Black", "Ned")),
(20, "IT", Seq("Ned", "is", "no", "more"))
).toDF("dept_id", "dept_nm", "emp_details")
df.select(slice($"emp_details", 1, 3) as "empt_details").show
+-------------------+
| empt_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
The same thing can be of course done in SQL
SELECT slice(emp_details, 1, 3) AS emp_details FROM df
Important:
Please note, that unlike Seq.slice
, values are indexed from zero and the second argument is length, not end position.