Why using UDFs leads to a Cartesian product instead of a full outer join?
The reason why using UDFs require Cartesian product is quite simple. Since you pass an arbitrary function with possibly infinite domain and non-deterministic behavior the only way to determine its value is to pass arguments and evaluate. It means you simply have to check all possible pairs.
Simple equality from the other hand has a predictable behavior. If you use t1.foo = t2.bar
condition you can simply shuffle t1
and t2
rows by foo
and bar
respectively to get expected result.
And just to be precise in the relational algebra outer join is actually expressed using natural join. Anything beyond that is simply an optimization.
Any way to force an outer join over the Cartesian product
Not really, unless you want to modify Spark SQL engine.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…