SQL: SELECTing a random subset of data
Show me some data, Baby
Have a table with a lot of data in and want to work with a random subset of that data? The answer is the Sample Clause of the SELECT statement. The syntax is often described as something like the following:
SELECT *
FROM my_table SAMPLE (percentage_of_rows);
For example:
SQL> CREATE TABLE ANDY.objs AS SELECT * FROM all_objects;
Table created.
SQL> SELECT COUNT(*) FROM ANDY.objs;
COUNT(*)
———-
71277
SQL> SELECT COUNT(*) FROM ANDY.objs SAMPLE(10);
COUNT(*)
———-
7107
"Some data" is not always the same "some data"
The above is demonstrably returning a random 10% of data from the ANDY.objs table. Great. Very useful. However, an (arguably) interesting thing happens if we run the exact same SELECT statement again:
SQL> SELECT COUNT(*) FROM ANDY.objs SAMPLE(10);
COUNT(*)
———-
7020
And again:
SQL> SELECT COUNT(*) FROM ANDY.objs SAMPLE(10);
COUNT(*)
———-
7209
We get a different number of rows each time. It’s still close enough to 10% of the data, you might say. Maybe so but it does highlight an interesting point: The Sample Clause does not bring back x% of rows (as is often thought) but rather gives each row an x% chance of being returned.
From the Oracle Documentation:
"This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample. It does not mean that the database will retrieve exactly sample_percent of the rows of table."
The moral of this tale
Use the Sample Clause to work with a random subset of data but don’t rely on it to bring back exactly the expected number of rows. Complete details on the Sample Clause can be found here and more examples can be found here.
on February 6, 2009 on 6:20 pm
“Some Data”, the same “Some Data”.. with SEED
SQL> select ename, job
2 from emp sample (10) seed (4)
3 /
ENAME JOB
———- ———
FORD ANALYST
SQL> /
ENAME JOB
———- ———
FORD ANALYST
on February 6, 2009 on 10:38 pm
Good point, Alex. You can use SEED if you want the same random set of data repeatedly.
“SEED seed_value:- Specify this clause to instruct the database to attempt to return the same sample from one execution to the next. The seed_value must be an integer between 0 and 4294967295. If you omit this clause, then the resulting sample will change from one execution to the next.”