Wednesday, 26 March 2014

Do the SQL Shuffle

A little puzzle arose when I was looking an anonymising client data. At one point I wondered if simply shuffling around some fairly innocuous text fields would be enough to separate some items from other parts of the data. The answer was no, but I still spent some time in the evening figuring out how I'd do it if I needed to - remind me to Get A Life at some point! :)

How to shuffle?


Suppose you have a simple table with an ID field, and a text description. You want to shuffle the 'des' data around so that they all still exist, but in different rows to where they started.

ID

DES
11

line 11
12

line 12
13

line 13
14

line 14
15

line 15
16

line 16
17

line 17
21

line 21
23

line 23

You can play with this using a temporary table (don't forget to DROP it later) :


-- create temp test data
CREATE TABLE #DataTable (
ID int,
Des varchar(10) );

WITH DATA AS (
   SELECT 11 AS ID, 'line 11' AS dez UNION ALL
   SELECT 12, 'line 12' UNION ALL
   SELECT 13, 'line 13' UNION ALL
   SELECT 14, 'line 14' UNION ALL
   SELECT 15, 'line 15' UNION ALL
   SELECT 16, 'line 16' UNION ALL
   SELECT 17, 'line 17' UNION ALL
   SELECT 21, 'line 21' UNION ALL
   SELECT 23, 'line 23' )
INSERT INTO #DataTable (ID, Des)
SELECT * FROM DATA


--  Insert Shuffle code here :)

SELECT * FROM #DataTable;
DROP TABLE #DataTable;


After finding that Rand() only executes once per SQL statement and is therefore useless for a random value per row..... eventually I found that  Abs(CheckSum(NewID())) %90 is a handy way to get the kind of random values I was after. Deciding to try a multi-stage CTE approach, I got this far :


With d1 AS
(
SELECT  -- shove a random value in
  ID,
  Des,
  Abs(CheckSum(NewID())) % 90 as Randm
FROM #DataTable
),
d2 as
(
SELECT  -- now get what we need with order sequences
  ID as ID,
  Row_number() OVER (ORDER BY ID) AS OrigRow,
  Row_number() OVER (ORDER BY Randm) as NewRow,
  Des as ChangedDes
FROM d1
)
SELECT * FROM d2 ORDER BY OrigRow



This gave me the following, and you can play with it at
http://sqlfiddle.com/#!3/fcba5/4/0

ID ORIGROW NEWROW CHANGEDDES
11 1 8 line 11
12 2 1 line 12
13 3 7 line 13
14 4 2 line 14
15 5 5 line 15
16 6 6 line 16
17 7 9 line 17
21 8 3 line 21
23 9 4 line 23

After some deep thought (very deep for that time of night) before retiring to my chamber, I came up with the idea of joining one row to another from the CTEs, from old order to new random order, and linking that to the original row being updated. :


-- Shuffle
With d1 AS
(
SELECT
  ID, Des, Abs(CheckSum(NewID())) % 90 as Randm
FROM #DataTable
),
d2 as
(
SELECT
  ID as ID,
  Row_number() OVER (ORDER BY ID) AS OrigRow,
  Row_number() OVER (ORDER BY Randm) as NewRow,
  Des as ChangedDes
FROM d1
)
UPDATE #DataTable
SET Des =
 (SELECT swap.ChangedDes
  FROM d2 swap
   INNER JOIN d2 old ON
     swap.NewRow = old.OrigRow
  WHERE old.ID = #DataTable.ID
 )
FROM #DataTable



That does the job nicely, and runs quickly enough too. See it in action at
http://sqlfiddle.com/#!3/fcba5/6/0

If you have any better ways to do it, I'm all ears. Comments welcomed, as always!

No comments:

Post a Comment