Correctness issue for queries using SELECT DISTINCT
06 Feb 2024
This page documents the preview (v2.21) version. Preview includes features under active development and is for development and testing only. For production, use the stable (v2024.1) version.
Description
The impacted releases contain a correctness issue for queries using the SELECT DISTINCT clause. The queries may return incorrect results when the following conditions are true.
- BatchNestedLoopJoin is enabled
- JOIN clause is present in the query
- A DISTINCT clause is present in the query
- The table has range indexes
The issue does not apply to hash or merge joins.
Mitigation
Upgrade to a release with the fix.
Details
- The issue is based on how the query planner handles the optimization of DISTINCT operations in the presence of join conditions involving columns from the inner relation.
- The issue concerns a query like
SELECT DISTINCT t2.k FROM t1 JOIN t2 ON t1.k = t2.k; where a distinct index scan is chosen on table t2 (inner relation). - Previously, the planner erroneously treated
t2.kas a constant during the distinct index scan because the conditiont2.k = <some terms not related to t2>made it appear constant. However, this assumption is incorrect in batch nested loop joins because a single scan on t2 can return multiple values fort2.kmatching different values fromt1.k. - The correct behavior is to include
t2.kin the distinct prefix (the columns used to eliminate duplicates) unless the condition is of the formt2.k = C, whereCis a true constant value. - In the case of a nested loop join with
foreach $1 in t1.k do; distinct index scan on t2 where t2.k = $1, it is correct to treatt2.kas a constant and exclude it from the distinct prefix. But for a batch nested loop join withforeach ($1, ..., $1024) in t1.k do; distinct index scan on t2 where t2.k in ($1, ..., $1024), it is incorrect to treatt2.kas a constant because the scan can return multiple values fort2.k. - The issue does not apply to hash or merge joins.