快速从一个表里随机获取n行数据的方法-凯发app官方网站

凯发app官方网站-凯发k8官网下载客户端中心 | | 凯发app官方网站-凯发k8官网下载客户端中心
  • 博客访问: 1156148
  • 博文数量: 166
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 3760
  • 用 户 组: 普通用户
  • 注册时间: 2011-03-30 13:00
个人简介

about me:oracle ace pro,optimistic,passionate and harmonious. focus on oracle,mysql and other database programming,peformance tuning,db design, j2ee,linux/aix,architecture tech,etc

文章分类

全部博文(166)

文章存档

2024年(21)

2023年(28)

2022年(43)

2020年(62)

2014年(3)

2013年(9)

相关博文
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·

分类: oracle

2024-05-06 15:29:59

  本文主要讲解如何从一个表里快速随机获取前n行的方法,用较少io,利用索引的快速扫描,快速从表里获取随机的前n行数据。 
 
建表语句如下:

点击(此处)折叠或打开

  1. 建表语句如下:
  2.  drop table t;
  3.  create table t(id number,name varchar2(1000),ext varchar2(1000));
  4.  insert into t
  5.  select level,rpad('t',500,'x')||level,rpad('t',500,'x')||level
  6.  from dual
  7.  connect by level<=300000;
  8.  commit;
  9.  create index idx_t on t(id);
  10.  
  11.  exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'t',no_invalidate=>false);
  12. t表总共有30w行:
     select count(*) from t;


      count(*)
    ----------
        300000
         

1.原始方法,全表扫描 排序,随机取10条记录

全表扫描,因为是随机的,所以要读完所有30w行再随机排序取10行,效率非常低
sql如下:

点击(此处)折叠或打开

  1. select * from (
  2. select * from t order by dbms_random.value
  3. ) where rownum <= 10;
执行计划如下:

点击(此处)折叠或打开

  1. plan hash value: 3299198703

  2. ------------------------------------------------------------------------------------------------------------------------------
  3. | id | operation | name | starts | e-rows | a-rows | a-time | buffers | reads | omem | 1mem | used-mem |
  4. ------------------------------------------------------------------------------------------------------------------------------
  5. | 0 | select statement | | 1 | | 10 |00:00:00.81 | 43171 | 22014 | | | |
  6. |* 1 | count stopkey | | 1 | | 10 |00:00:00.81 | 43171 | 22014 | | | |
  7. | 2 | view | | 1 | 300k| 10 |00:00:00.81 | 43171 | 22014 | | | |
  8. |* 3 | sort order by stopkey| | 1 | 300k| 10 |00:00:00.81 | 43171 | 22014 | 106k| 106k|96256 (0)|
  9. | 4 | table access full | t | 1 | 300k| 300k|00:00:00.18 | 43171 | 22014 | | | |
  10. ------------------------------------------------------------------------------------------------------------------------------

  11. predicate information (identified by operation id):
  12. ---------------------------------------------------

  13.    1 - filter(rownum<=10)
  14.    3 - filter(rownum<=10)
通过执行计划可以看到id=4,读了表t的全部数据300k行,然后排序取10行。耗时0.81s,逻辑读43171,从执行计划看,主要耗时在排序上,
因为是对30w行随机排序,所以效率很低,后续优化要考虑避免扫描全量数据以及避免全量数据的随机排序。

2.改进方法,利用覆盖索引提高效率

因为id有索引,索引比表小,可以采用覆盖索引,先从索引里随机获取前10行,然后再到表里查。

可以从下面看到idx_t索引只有6m,比表345m少很多:

点击(此处)折叠或打开

  1. owner segment_name segment_type size_mb iniext maxext
  2. ---------- ------------------------------ ------------------ ---------- ---------- ----------
  3. dingjun123 t table 345 65536 1048576
  4. dingjun123 idx_t index 6 65536 1048576
为了能够走覆盖索引,将id改为not null:

点击(此处)折叠或打开

  1. alter table t modify id not null;
sql如下:

点击(此处)折叠或打开

  1. select * from t where id in (
  2. select id
  3. from(
  4. select id from t order by dbms_random.value
  5. ) where rownum <= 10
  6. );
执行计划如下:

点击(此处)折叠或打开

  1. plan hash value: 753993260

  2. ---------------------------------------------------------------------------------------------------------------------------------------
  3. | id | operation | name | starts | e-rows | a-rows | a-time | buffers | reads | omem | 1mem | used-mem |
  4. ---------------------------------------------------------------------------------------------------------------------------------------
  5. | 0 | select statement | | 1 | | 10 |00:00:00.41 | 699 | 5 | | | |
  6. | 1 | nested loops | | 1 | | 10 |00:00:00.41 | 699 | 5 | | | |
  7. | 2 | nested loops | | 1 | 10 | 10 |00:00:00.41 | 689 | 0 | | | |
  8. | 3 | view | vw_nso_1 | 1 | 10 | 10 |00:00:00.41 | 676 | 0 | | | |
  9. | 4 | hash unique | | 1 | 10 | 10 |00:00:00.41 | 676 | 0 | 1397k| 1397k| 1259k (0)|
  10. |* 5 | count stopkey | | 1 | | 10 |00:00:00.41 | 676 | 0 | | | |
  11. | 6 | view | | 1 | 300k| 10 |00:00:00.41 | 676 | 0 | | | |
  12. |* 7 | sort order by stopkey | | 1 | 300k| 10 |00:00:00.41 | 676 | 0 | 2048 | 2048 | 2048 (0)|
  13. | 8 | index fast full scan | idx_t | 1 | 300k| 300k|00:00:00.03 | 676 | 0 | | | |
  14. |* 9 | index range scan | idx_t | 10 | 1 | 10 |00:00:00.01 | 13 | 0 | | | |
  15. | 10 | table access by index rowid| t | 10 | 1 | 10 |00:00:00.01 | 10 | 5 | | | |
  16. ---------------------------------------------------------------------------------------------------------------------------------------

  17. predicate information (identified by operation id):
  18. ---------------------------------------------------

  19.    5 - filter(rownum<=10)
  20.    7 - filter(rownum<=10)
  21.    9 - access("id"="id")
通过覆盖索引,0.41s,逻辑读从4.3w降低到676,主要耗时还是在随机排序上(占95%以上)。

3.使用索引min/max快速扫描构造随机数提高效率

优化主要集中在下面两点:
1)怎么避免扫描全量数据
2)怎么避免对全量数据随机排序
避免扫描全量数据,必须构造一个id值,从此id处开始扫描,然后用id排序,则可以消除以上两点。方法如下:

索引的min,max查询效率很高(oracle不能写在同一个select里否则不能用快速扫描),只需要扫描左右侧索引no.1叶子节点,可以实现快速扫描,
然后利用floor(tmp.min_id (tmp.max_id - tmp.min_id 1) * dbms_random.value) 构造随机id,然后与原表关联,利用索引列id排序,
因为是索引列,可以利用索引消除排序,然后取100条,为了防止断号(因为取10条的话,这个id是连续的,也不够连续,如果id本身不是连续的,
也可能取不到10条),先取100条,然后从100条里随机取10条,比全扫描效率高很多

sql如下:

点击(此处)折叠或打开

  1. select *
  2.   from t
  3.  where id in
  4.        (select id
  5.           from (select id
  6.                   from (select id
  7.                           from (select id
  8.                                   from t
  9.                                  where t.id >=
  10.                                        (select floor(tmp.min_id
  11.                                                      (tmp.max_id - tmp.min_id 1) *
  12.                                                      dbms_random.value) id
  13.                                           from (select (select max(id) from t) max_id,
  14.                                                        (select min(id) from t) min_id
  15.                                                   from dual) tmp)
  16.                                  order by t.id)
  17.                          where rownum <= 100)
  18.                  order by dbms_random.value)
  19.          where rownum <= 10);
执行计划如下:

点击(此处)折叠或打开

  1. plan hash value: 72274151

  2. -----------------------------------------------------------------------------------------------------------------------------------------
  3. | id | operation | name | starts | e-rows | a-rows | a-time | buffers | omem | 1mem | used-mem |
  4. -----------------------------------------------------------------------------------------------------------------------------------------
  5. | 0 | select statement | | 1 | | 10 |00:00:00.01 | 30 | | | |
  6. | 1 | nested loops | | 1 | | 10 |00:00:00.01 | 30 | | | |
  7. | 2 | nested loops | | 1 | 10 | 10 |00:00:00.01 | 20 | | | |
  8. | 3 | view | vw_nso_1 | 1 | 10 | 10 |00:00:00.01 | 7 | | | |
  9. | 4 | hash unique | | 1 | 10 | 10 |00:00:00.01 | 7 | 1397k| 1397k| 847k (0)|
  10. |* 5 | count stopkey | | 1 | | 10 |00:00:00.01 | 7 | | | |
  11. | 6 | view | | 1 | 100 | 10 |00:00:00.01 | 7 | | | |
  12. |* 7 | sort order by stopkey | | 1 | 100 | 10 |00:00:00.01 | 7 | 2048 | 2048 | 2048 (0)|
  13. | 8 | view | | 1 | 100 | 100 |00:00:00.01 | 7 | | | |
  14. |* 9 | count stopkey | | 1 | | 100 |00:00:00.01 | 7 | | | |
  15. | 10 | view | | 1 | 100 | 100 |00:00:00.01 | 7 | | | |
  16. |* 11 | index range scan | idx_t | 1 | 100 | 100 |00:00:00.01 | 7 | | | |
  17. | 12 | fast dual | | 1 | 1 | 1 |00:00:00.01 | 0 | | | |
  18. | 13 | sort aggregate | | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  19. | 14 | index full scan (min/max) | idx_t | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  20. | 15 | sort aggregate | | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  21. | 16 | index full scan (min/max)| idx_t | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  22. |* 17 | index range scan | idx_t | 10 | 1 | 10 |00:00:00.01 | 13 | | | |
  23. | 18 | table access by index rowid | t | 10 | 1 | 10 |00:00:00.01 | 10 | | | |
  24. -----------------------------------------------------------------------------------------------------------------------------------------

  25. predicate information (identified by operation id):
  26. ---------------------------------------------------

  27.    5 - filter(rownum<=10)
  28.    7 - filter(rownum<=10)
  29.    9 - filter(rownum<=100)
  30.   11 - access("t"."id">=)
  31.   17 - access("id"="id")
从执行计划看,改进的sql逻辑读只需要30,执行时间0.01s。

4)使用索引min/max 递归with构造前n个随机数
和3类似,只不过这里使用递归with,先构造一个随机id,然后递归,取100条,然后随机排序后取10条。
如下所示:
sql如下:

点击(此处)折叠或打开

  1. with tmp(rn,id) as
  2. (
  3. select 1 rn,floor(tmp.min_id (tmp.max_id - tmp.min_id 1) * dbms_random.value) id
  4.         from (select (select max(id) from t) max_id,
  5.                       (select min(id) from t) min_id
  6.                      from dual) tmp
  7. union all
  8. select rn1,id
  9. from tmp
  10. where rn <100
  11. )
  12. select * from t where id in (
  13. select id from (
  14. select idrn as id from tmp
  15. order by dbms_random.value
  16. ) where rownum <= 10
  17. );
执行计划如下:

点击(此处)折叠或打开

  1. plan hash value: 2701499939

  2. --------------------------------------------------------------------------------------------------------------------------------------------------
  3. | id | operation | name | starts | e-rows | a-rows | a-time | buffers | omem | 1mem | used-mem |
  4. --------------------------------------------------------------------------------------------------------------------------------------------------
  5. | 0 | select statement | | 1 | | 10 |00:00:00.01 | 22 | | | |
  6. | 1 | nested loops | | 1 | | 10 |00:00:00.01 | 22 | | | |
  7. | 2 | nested loops | | 1 | 1 | 10 |00:00:00.01 | 13 | | | |
  8. | 3 | view | vw_nso_1 | 1 | 2 | 10 |00:00:00.01 | 4 | | | |
  9. | 4 | hash unique | | 1 | 1 | 10 |00:00:00.01 | 4 | 1397k| 1397k| 1092k (0)|
  10. |* 5 | count stopkey | | 1 | | 10 |00:00:00.01 | 4 | | | |
  11. | 6 | view | | 1 | 2 | 10 |00:00:00.01 | 4 | | | |
  12. |* 7 | sort order by stopkey | | 1 | 2 | 10 |00:00:00.01 | 4 | 2048 | 2048 | 2048 (0)|
  13. | 8 | view | | 1 | 2 | 100 |00:00:00.01 | 4 | | | |
  14. | 9 | union all (recursive with) breadth first| | 1 | | 100 |00:00:00.01 | 4 | | | |
  15. | 10 | sort aggregate | | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  16. | 11 | index full scan (min/max) | idx_t | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  17. | 12 | sort aggregate | | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  18. | 13 | index full scan (min/max) | idx_t | 1 | 1 | 1 |00:00:00.01 | 2 | | | |
  19. | 14 | fast dual | | 1 | 1 | 1 |00:00:00.01 | 0 | | | |
  20. | 15 | recursive with pump | | 100 | | 99 |00:00:00.01 | 0 | | | |
  21. |* 16 | index range scan | idx_t | 10 | 1 | 10 |00:00:00.01 | 9 | | | |
  22. | 17 | table access by index rowid | t | 10 | 1 | 10 |00:00:00.01 | 9 | | | |
  23. --------------------------------------------------------------------------------------------------------------------------------------------------

  24. predicate information (identified by operation id):
  25. ---------------------------------------------------

  26.    5 - filter(rownum<=10)
  27.    7 - filter(rownum<=10)
  28.   16 - access("id"="id")
逻辑读22,执行时间0.01s。

总结:
从一个表里随机取10条记录的关键在于:
1)避免扫描全部数据
2)避免对全部数据随机排序
所以关键在于构造一个随机id,然后用这个id为起始点去表里找,可以走索引查找,然后可以随机
取100条(为了避免id不连续的断号问题),减少扫描和排序量,充分利用索引的高效扫描
提高效率。
构造随机id,利用扫描索引取min,max可以快速扫描,然后利用min_id为起点 随机数,
这个随机数用(tmp.max_id - tmp.min_id 1) * dbms_random.value
构造,然后用floor取整即可。




阅读(3243) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~
")); function link(t){ var href= $(t).attr('href'); href ="?url=" encodeuricomponent(location.href); $(t).attr('href',href); //setcookie("returnouturl", location.href, 60, "/"); }
网站地图