Home > SQL > SQL: SELECTing a random subset of data

SQL: SELECTing a random subset of data

Show me some data, Baby

Have a table with a lot of data in and want to work with a random subset of that data? The answer is the Sample Clause of the SELECT statement. The syntax is often described as something like the following:

SELECT *
FROM my_table SAMPLE (percentage_of_rows);

For example:

SQL> CREATE TABLE ANDY.objs AS SELECT * FROM all_objects;

Table created.

SQL> SELECT COUNT(*) FROM ANDY.objs;

COUNT(*)
———-
71277

SQL> SELECT COUNT(*) FROM ANDY.objs SAMPLE(10);

COUNT(*)
———-
7107

"Some data" is not always the same "some data"

The above is demonstrably returning a random 10% of data from the ANDY.objs table. Great. Very useful. However, an (arguably) interesting thing happens if we run the exact same SELECT statement again:

SQL> SELECT COUNT(*) FROM ANDY.objs SAMPLE(10);

COUNT(*)
———-
7020

And again:

SQL> SELECT COUNT(*) FROM ANDY.objs SAMPLE(10);

COUNT(*)
———-
7209

We get a different number of rows each time. It’s still close enough to 10% of the data, you might say. Maybe so but it does highlight an interesting point: The Sample Clause does not bring back x% of rows (as is often thought) but rather gives each row an x% chance of being returned.

From the Oracle Documentation:
"This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample. It does not mean that the database will retrieve exactly sample_percent of the rows of table."

The moral of this tale

Use the Sample Clause to work with a random subset of data but don’t rely on it to bring back exactly the expected number of rows. Complete details on the Sample Clause can be found here and more examples can be found here.

Advertisements
Categories: SQL
  1. February 6, 2009 at 6:20 pm

    “Some Data”, the same “Some Data”.. with SEED
    SQL> select ename, job
    2 from emp sample (10) seed (4)
    3 /

    ENAME JOB
    ———- ———
    FORD ANALYST

    SQL> /

    ENAME JOB
    ———- ———
    FORD ANALYST

  2. Andrew Tulley
    February 6, 2009 at 10:38 pm

    Good point, Alex. You can use SEED if you want the same random set of data repeatedly.

    SEED seed_value:- Specify this clause to instruct the database to attempt to return the same sample from one execution to the next. The seed_value must be an integer between 0 and 4294967295. If you omit this clause, then the resulting sample will change from one execution to the next.”

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: