关于朋友随手抛出的一段 SQL,发现 MySQL 中关于 max()和 group by 联合使用中的一个坑,特此整理。

YH:老铁们,这段 hql 对不对啊

我扫了一眼,总觉得看着别扭,自己试着去掉字符串拼接,还原出 SQL 来看,依然感觉不对,然后自己试着写了查询,在本地建个表,造了些数据,用简化后的 SQL 做测试时, 当我定睛检查以下这句 SQL

select predictId, max(evaluateDate) evalDate, productId from productcashpredict group by productId;

抛出一个疑问,MySQL 是从后往前执行,先分组再求分组结果中 evaluateDate 最大的记录呢?还是先找出 evaluateDate 的最大记录,再分组呢?

网上查了查,发现,都不是!这里有个坑!如果直接这么结合 max 和 group by 使用,查出的结果,除了求 max 的字段和分组条件 productId 字段,其他字段的值都是错的!

首先我在本地验证了一下是不是的确如此

desc productcashpredict;

predictId	int(11)	NO	PRI		auto_increment
evaluateDate	datetime	YES			on update CURRENT_TIMESTAMP
other	varchar(255)	YES
productId	int(11)	YES

select * from productcashpredict

+-----------+---------------------+-------+-----------+
| predictId | evaluateDate        | other | productId |
+-----------+---------------------+-------+-----------+
|         1 | 2017-10-31 18:14:37 | NULL  |     10001 |
|         2 | 2017-10-31 18:14:45 | NULL  |     10002 |
|         3 | 2017-10-31 18:14:59 | NULL  |     10002 |
|         4 | 2017-10-31 18:15:09 | NULL  |     10003 |
|         5 | 2017-10-31 18:15:22 | NULL  |     10001 |
+-----------+---------------------+-------+-----------+


select predictId, max(evaluateDate) evalDate, productId from productcashpredict group by productId;

+-----------+---------------------+-----------+
| predictId | evalDate            | productId |
+-----------+---------------------+-----------+
|         1 | 2017-10-31 18:15:22 |     10001 |
|         2 | 2017-10-31 18:14:59 |     10002 |
|         4 | 2017-10-31 18:15:09 |     10003 |
+-----------+---------------------+-----------+

直接这样查的确是错的,看 predictId 可以看出

我们可以看出来,MySQL 其实是把以下两种查询揉在了一起,结果的确是按照 productId 进行分组的,查询结果也是从取自分组后,排序则是默认按照主键排序,但 evalDate 也的确取的是每一个 productId 组内最大的那一项,但这样揉起来,数据就不对了。

select * from productcashpredict group by productId;

1	2017-10-31 18:14:37		10001
2	2017-10-31 18:14:45		10002
4	2017-10-31 18:15:09		10003

select max(evaluateDate) evalDate, productId from productcashpredict;

2017-10-31 18:15:22	10001

既然这样,那总得有解决方案,我又问了抛问者 YH,他们实际的使用场景是怎么查的,他甩出一张图。 我看了半天,很是理解不了,至少觉得逻辑不甚清晰,我开始质疑这样查询的结果是否正确,于是又去寻求 max 和 group by 联合使用的正确写法,得到下面两种方案:

于是我又建了一张表,来模拟真实场景

desc product_cash_predict;

id	int(11)	NO	PRI		auto_increment
add_time	datetime	YES			on update CURRENT_TIMESTAMP
deleted	varchar(255)	YES
update_time	datetime	YES			on update CURRENT_TIMESTAMP
version	varchar(255)	YES
evaluated_date	datetime	YES			on update CURRENT_TIMESTAMP
other	varchar(255)	YES
product_id	int(11)	YES

SELECT * FROM `product_cash_predict`;

+----+----------+---------+-------------+---------+---------------------+-------+------------+
| id | add_time | deleted | update_time | version | evaluated_date      | other | product_id |
+----+----------+---------+-------------+---------+---------------------+-------+------------+
|  1 | NULL     | NULL    | NULL        | NULL    | 2017-10-31 18:14:37 | NULL  |      10001 |
|  2 | NULL     | NULL    | NULL        | NULL    | 2017-10-31 18:14:45 | NULL  |      10002 |
|  3 | NULL     | NULL    | NULL        | NULL    | 2017-10-31 18:14:59 | NULL  |      10002 |
|  4 | NULL     | NULL    | NULL        | NULL    | 2017-10-31 18:15:09 | NULL  |      10003 |
|  5 | NULL     | NULL    | NULL        | NULL    | 2017-10-31 18:15:22 | NULL  |      10001 |
+----+----------+---------+-------------+---------+---------------------+-------+------------+

先按 YH 的业务场景查询方式查一下:

SELECT
	p2.*
FROM
	(
		SELECT
			max(evaluated_date) evaluated_date,
			product_id
		FROM
			product_cash_predict
		GROUP BY
			product_id
	) p1
LEFT JOIN product_cash_predict p2 ON p1.product_id = p2.product_id
AND p1.evaluated_date = p2.evaluated_date

-- 0.034s
--
+----+----------+---------+-------------+---------+---------------------+-------+------------+
| id | add_time | deleted | update_time | version | evaluated_date      | other | product_id |
+----+----------+---------+-------------+---------+---------------------+-------+------------+
|  3 | NULL     | NULL    | NULL        | NULL    | 2017-10-31 18:14:59 | NULL  |      10002 |
|  4 | NULL     | NULL    | NULL        | NULL    | 2017-10-31 18:15:09 | NULL  |      10003 |
|  5 | NULL     | NULL    | NULL        | NULL    | 2017-10-31 18:15:22 | NULL  |      10001 |
+----+----------+---------+-------------+---------+---------------------+-------+------------+

那么这个结果对不对呢

先来看一下错误的查询方式:

select id, add_time, deleted, update_time, version,
max(evaluated_date) evalDate, other, product_id
from product_cash_predict group by product_id;


1					2017-10-31 18:15:22		10001
2					2017-10-31 18:14:59		10002
4					2017-10-31 18:15:09		10003

由于数据量少,上面查询的错误肉眼可以用识别


-- 解决方案一:先排序再分组

select * from product_cash_predict ORDER BY evaluated_date desc

5					2017-10-31 18:15:22		10001
4					2017-10-31 18:15:09		10003
3					2017-10-31 18:14:59		10002
2					2017-10-31 18:14:45		10002
1					2017-10-31 18:14:37		10001

select * from (select * from product_cash_predict ORDER BY evaluated_date desc) group by product_id;
--  Every derived table must have its own alias(衍生表需要有其自己的别名)

select * from (select * from product_cash_predict ORDER BY evaluated_date desc) as result group by product_id;

-- 正确结果
5					2017-10-31 18:15:22		10001
3					2017-10-31 18:14:59		10002
4					2017-10-31 18:15:09		10003
-- 0.025s
-- 0.025s
-- 0.027s
-- 解决方案二:看上去很费解
select * from product_cash_predict p where p.evaluated_date
=
(select max(evaluated_date) from product_cash_predict where p.product_id = product_id)
group by product_id;

5					2017-10-31 18:15:22		10001
3					2017-10-31 18:14:59		10002
4					2017-10-31 18:15:09		10003

-- 0.028s
-- 0.019s
-- 0.025s

最后,也就是 YH 实际业务场景的处理方式再来回顾分析一波

-- 解决方案三:自连接,根据max结合group by查出最大日期和分组条件product_id,再自连接查出该product_id对应的其他字段

-- 只查最大日期和分组条件product_id,若查其他字段则为不准的数据(group by取分组第一条)
select max(evaluated_date) evalDate, product_id from product_cash_predict GROUP BY product_id

2017-10-31 18:15:22	10001
2017-10-31 18:14:59	10002
2017-10-31 18:15:09	10003

-- 把查出的结果作为p1集
select p1.* from (
	select max(evaluated_date) evalDate, product_id from product_cash_predict GROUP BY product_id
) p1;

-- 然后从根据此结果自连接后查出的结果中取真是的记录值
select p2.* from (
	select max(evaluated_date) evalDate, product_id from product_cash_predict GROUP BY product_id
) p1 LEFT JOIN product_cash_predict p2 on p1.product_id = p2.product_id and p1.evalDate = p2.evaluated_date;

3					2017-10-31 18:14:59		10002
4					2017-10-31 18:15:09		10003
5					2017-10-31 18:15:22		10001

性能比较

explain select * from (select * from product_cash_predict ORDER BY evaluated_date desc) as result group by product_id;

+----+-------------+----------------------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table                | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+----------------------+------+---------------+------+---------+------+------+---------------------------------+
|  1 | PRIMARY     | <derived2>           | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using temporary; Using filesort |
|  2 | DERIVED     | product_cash_predict | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using filesort                  |
+----+-------------+----------------------+------+---------------+------+---------+------+------+---------------------------------+

explain select * from product_cash_predict p where p.evaluated_date =
(select max(evaluated_date) from product_cash_predict where p.product_id = product_id)
group by product_id;

+----+--------------------+----------------------+------+---------------+------+---------+------+------+----------------------------------------------+
| id | select_type        | table                | type | possible_keys | key  | key_len | ref  | rows | Extra                                        |
+----+--------------------+----------------------+------+---------------+------+---------+------+------+----------------------------------------------+
|  1 | PRIMARY            | p                    | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | product_cash_predict | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using where                                  |
+----+--------------------+----------------------+------+---------------+------+---------+------+------+----------------------------------------------+

explain select p2.* from (
	select max(evaluated_date) evalDate, product_id from product_cash_predict GROUP BY product_id
) p1 LEFT JOIN product_cash_predict p2 on p1.product_id = p2.product_id and p1.evalDate = p2.evaluated_date;

+----+-------------+----------------------+------+---------------+------+---------+------+------+----------------------------------------------------+
| id | select_type | table                | type | possible_keys | key  | key_len | ref  | rows | Extra                                              |
+----+-------------+----------------------+------+---------------+------+---------+------+------+----------------------------------------------------+
|  1 | PRIMARY     | <derived2>           | ALL  | NULL          | NULL | NULL    | NULL |    5 | NULL                                               |
|  1 | PRIMARY     | p2                   | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using where; Using join buffer (Block Nested Loop) |
|  2 | DERIVED     | product_cash_predict | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using temporary; Using filesort                    |
+----+-------------+----------------------+------+---------------+------+---------+------+------+----------------------------------------------------+

END

致谢:问题提出者,同时也是本文校对者 @YH

总结:虽然可能这只是 SQL 查询中的一个小知识点,但不经分析,直接使用,可能会给业务带来不必要的坑,正所谓磨刀不误砍柴工,对常用技术的深入理解应该成为一个技术人的日常习惯。