Andy’s Blog: Application Express etc.


SQL: SELECTing a random subset of data

Posted in SQL by Andrew Tulley on the February 6, 2009

Show me some data, Baby

Have a table with a lot of data in and want to work with a random subset of that data? The answer is the Sample Clause of the SELECT statement. The syntax is often described as something like the following:

SELECT *
FROM my_table SAMPLE (percentage_of_rows);

For example:

SQL> CREATE TABLE ANDY.objs AS SELECT * FROM all_objects;

Table created.

SQL> SELECT COUNT(*) FROM ANDY.objs;

COUNT(*)
———-
71277

SQL> SELECT COUNT(*) FROM ANDY.objs SAMPLE(10);

COUNT(*)
———-
7107

"Some data" is not always the same "some data"

The above is demonstrably returning a random 10% of data from the ANDY.objs table. Great. Very useful. However, an (arguably) interesting thing happens if we run the exact same SELECT statement again:

SQL> SELECT COUNT(*) FROM ANDY.objs SAMPLE(10);

COUNT(*)
———-
7020

And again:

SQL> SELECT COUNT(*) FROM ANDY.objs SAMPLE(10);

COUNT(*)
———-
7209

We get a different number of rows each time. It’s still close enough to 10% of the data, you might say. Maybe so but it does highlight an interesting point: The Sample Clause does not bring back x% of rows (as is often thought) but rather gives each row an x% chance of being returned.

From the Oracle Documentation:
"This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample. It does not mean that the database will retrieve exactly sample_percent of the rows of table."

The moral of this tale

Use the Sample Clause to work with a random subset of data but don’t rely on it to bring back exactly the expected number of rows. Complete details on the Sample Clause can be found here and more examples can be found here.

2 Responses to 'SQL: SELECTing a random subset of data'

Subscribe to comments with RSS or TrackBack to 'SQL: SELECTing a random subset of data'.

  1. Alex Nuijten said,

    “Some Data”, the same “Some Data”.. with SEED
    SQL> select ename, job
    2 from emp sample (10) seed (4)
    3 /

    ENAME JOB
    ———- ———
    FORD ANALYST

    SQL> /

    ENAME JOB
    ———- ———
    FORD ANALYST

  2. Andrew Tulley said,

    Good point, Alex. You can use SEED if you want the same random set of data repeatedly.

    SEED seed_value:- Specify this clause to instruct the database to attempt to return the same sample from one execution to the next. The seed_value must be an integer between 0 and 4294967295. If you omit this clause, then the resulting sample will change from one execution to the next.”


Leave a Reply