A subquery's outer statement can be any one of:
SELECT
,INSERT
,UPDATE
,DELETE
,SET
, orDO
.Subquery optimization for
IN
is not as effective as for the=
operator or for theIN(
operator.value_list
)A typical case for poor
IN
subquery performance is when the subquery returns a small number of rows but the outer query returns a large number of rows to be compared to the subquery result.The problem is that, for a statement that uses an
IN
subquery, the optimizer rewrites it as a correlated subquery. Consider the following statement that uses an uncorrelated subquery:SELECT ... FROM t1 WHERE t1.a IN (SELECT b FROM t2);
The optimizer rewrites the statement to a correlated subquery:
SELECT ... FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.b = t1.a);
If the inner and outer queries return
M
andN
rows, respectively, the execution time becomes on the order of O(M
×N
), rather than O(M
+N
) as it would be for an uncorrelated subquery.An implication is that an
IN
subquery can be much slower than a query written using anIN(
operator that lists the same values that the subquery would return.value_list
)In general, you cannot modify a table and select from the same table in a subquery. For example, this limitation applies to statements of the following forms:
DELETE FROM t WHERE ... (SELECT ... FROM t ...); UPDATE t ... WHERE col = (SELECT ... FROM t ...); {INSERT|REPLACE} INTO t (SELECT ... FROM t ...);
Exception: The preceding prohibition does not apply if you are using a subquery for the modified table in the
FROM
clause. Example:UPDATE t ... WHERE col = (SELECT * FROM (SELECT ... FROM t...) AS _t ...);
Here the result from the subquery in the
FROM
clause is stored as a temporary table, so the relevant rows int
have already been selected by the time the update tot
takes place.Row comparison operations are only partially supported:
For
,expr
IN (subquery
)expr
can be ann
-tuple (specified using row constructor syntax) and the subquery can return rows ofn
-tuples.For
,expr
op
{ALL|ANY|SOME} (subquery
)expr
must be a scalar value and the subquery must be a column subquery; it cannot return multiple-column rows.
In other words, for a subquery that returns rows of
n
-tuples, this is supported:(
val_1
, ...,val_n
) IN (subquery
)But this is not supported:
(
val_1
, ...,val_n
)op
{ALL|ANY|SOME} (subquery
)The reason for supporting row comparisons for
IN
but not for the others is thatIN
is implemented by rewriting it as a sequence of=
comparisons andAND
operations. This approach cannot be used forALL
,ANY
, orSOME
.Subqueries in the
FROM
clause cannot be correlated subqueries. They are materialized (executed to produce a result set) before evaluating the outer query, so they cannot be evaluated per row of the outer query.MySQL does not support
LIMIT
in subqueries for certain subquery operators:mysql>
SELECT * FROM t1
->WHERE s1 IN (SELECT s2 FROM t2 ORDER BY s1 LIMIT 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'The optimizer is more mature for joins than for subqueries, so in many cases a statement that uses a subquery can be executed more efficiently if you rewrite it as a join.
An exception occurs for the case where an
IN
subquery can be rewritten as aSELECT DISTINCT
join. Example:SELECT col FROM t1 WHERE id_col IN (SELECT id_col2 FROM t2 WHERE
condition
);That statement can be rewritten as follows:
SELECT DISTINCT col FROM t1, t2 WHERE t1.id_col = t2.id_col AND
condition
;But in this case, the join requires an extra
DISTINCT
operation and is not more efficient than the subquery.MySQL permits a subquery to refer to a stored function that has data-modifying side effects such as inserting rows into a table. For example, if
f()
inserts rows, the following query can modify data:SELECT ... WHERE x IN (SELECT f() ...);
This behavior is nonstandard (not permitted by the SQL standard). In MySQL, it can produce indeterminate results because
f()
might be executed a different number of times for different executions of a given query depending on how the optimizer chooses to handle it.For statement-based or mixed-format replication, one implication of this indeterminism is that such a query can produce different results on the master and its slaves.
Possible future optimization: MySQL does not rewrite the join order for subquery evaluation. In some cases, a subquery could be executed more efficiently if MySQL rewrote it as a join. This would give the optimizer a chance to choose between more execution plans. For example, it could decide whether to read one table or the other first.
Example:
SELECT a FROM outer_table AS ot WHERE a IN (SELECT a FROM inner_table AS it WHERE ot.b = it.b);
For that query, MySQL always scans
outer_table
first and then executes the subquery oninner_table
for each row. Ifouter_table
has a lot of rows andinner_table
has few rows, the query probably will not be as fast as it could be.The preceding query could be rewritten like this:
SELECT a FROM outer_table AS ot, inner_table AS it WHERE ot.a = it.a AND ot.b = it.b;
In this case, we can scan the small table (
inner_table
) and look up rows inouter_table
, which will be fast if there is an index on(ot.a,ot.b)
.Possible future optimization: A correlated subquery is evaluated for each row of the outer query. A better approach is that if the outer row values do not change from the previous row, do not evaluate the subquery again. Instead, use its previous result.
Possible future optimization: A subquery in the
FROM
clause is evaluated by materializing the result into a temporary table, and this table does not use indexes. This does not allow the use of indexes in comparison with other tables in the query, although that might be useful.Possible future optimization: If a subquery in the
FROM
clause resembles a view to which the merge algorithm can be applied, rewrite the query and apply the merge algorithm so that indexes can be used. The following statement contains such a subquery:SELECT * FROM (SELECT * FROM t1 WHERE t1.t1_col) AS _t1, t2 WHERE t2.t2_col;
The statement can be rewritten as a join like this:
SELECT * FROM t1, t2 WHERE t1.t1_col AND t2.t2_col;
This type of rewriting would provide two benefits:
It avoids the use of a temporary table for which no indexes can be used. In the rewritten query, the optimizer can use indexes on
t1
.It gives the optimizer more freedom to choose between different execution plans. For example, rewriting the query as a join enables the optimizer to use
t1
ort2
first.
Possible future optimization: For
IN
,= ANY
,<> ANY
,= ALL
, and<> ALL
with uncorrelated subqueries, use an in-memory hash for a result or a temporary table with an index for larger results. Example:SELECT a FROM big_table AS bt WHERE non_key_field IN (SELECT non_key_field FROM
table
WHEREcondition
)In this case, we could create a temporary table:
CREATE TABLE t (key (non_key_field)) (SELECT non_key_field FROM
table
WHEREcondition
)Then, for each row in
big_table
, do a key lookup int
based onbt.non_key_field
.