I'm trying to clean up some nested data and extract the fields I care about.
My schema for the nested value is:
|-- maritalstatus: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- text_: string (nullable = true)
| |-- text__extensions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- extension: array (nullable = true)
| | |-- element: string (containsNull = true)
I want to extract the text_
field as its own column.
I have tried:
df.select(col("maritalstatus.text_")).show()
and df.select(col("maritalstatus")["text_"]).show()
but it returns:
+-----+
|text_|
+-----+
| null|
| null|
...
| null|
+-----+
When I define a UDF as:
def getMaritalStatus(ms):
return ms.text_
gms = udf(getMaritalStatus, StringType())
and do df.select(gms(col("maritalstatus")).show()
it returns the data that i expect
Interestingly, I have another nested struct field that has a similar structure but with numbers as keys instead of names, and I am able to use the df.select(col("birthdate")["0"]).show()
notation
Birthdate's schema:
root
|-- birthdate: struct (nullable = true)
| |-- 0: date (nullable = true)
| |-- 1: integer (nullable = true)
Is there anyway to extract maritalstatus.text_
without using the UDF? I've heard UDF is not as performant as other approaches?
cassandra table structure:
CREATE TABLE keyspace.patient (
id text PRIMARY KEY,
active boolean,
active_extensions list<text>,
address list<frozen<address>>,
birthdate frozen<tuple<date, int>>,
birthdate_extensions list<text>,
communication list<frozen<patient_communication>>,
contact list<frozen<patient_contact>>,
contained list<frozen<tuple<text, text, text>>>,
deceasedboolean boolean,
deceasedboolean_extensions list<text>,
deceaseddatetime frozen<tuple<timestamp, text, int>>,
deceaseddatetime_extensions list<text>,
extension list<text>,
gender text,
gender_extensions list<text>,
generalpractitioner list<text>,
identifier list<frozen<identifier>>,
implicitrules text,
implicitrules_extensions list<text>,
language text,
language_extensions list<text>,
link list<frozen<patient_link>>,
managingorganization text,
maritalstatus frozen<codeableconcept>,
meta frozen<meta>,
modifierextension list<text>,
multiplebirthboolean boolean,
multiplebirthboolean_extensions list<text>,
multiplebirthinteger int,
multiplebirthinteger_extensions list<text>,
name list<frozen<humanname>>,
photo list<frozen<attachment>>,
telecom list<frozen<contactpoint>>,
text_ frozen<narrative>
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
and the codeableconcept
is
CREATE TYPE keyspace.codeableconcept (
extension list<text>,
text_ text,
text__extensions list<text>,
id text,
coding list<frozen<coding>>
);