(I'm back working again after many long months unemployed, so I should be able to post little odds and ends here again)
If you want to demo your application using real data that you've brought back to the office from a client site, it's rather important to remove real names and addresses, etc., from the data before letting other parties set eyes on it. It won't inspire much confidence to show a potential client another client's data!
If you're using MS SQL Server then you may have the AdventureWorks database to hand, chock full of random names that can be used to overwrite the names and addresses that you wish to remove. Getting a random number to work in a simple UPDATE statement is a bit tricky because the RAND() is only evaluated once - so you'll get a different result each time but all the rows will be set the same :(
So I called upon my favourite CTE know-how and cooked up the following. It's a cascading multi-step CTE affair. First it retrieves the
Max(ContactID)from the Contact table, so that the 0-to-1 random number can be scaled up to the full range of the table from which we'll be grabbing random names.
Next, we create a "ChangeData1" temporary table of IDs for the table we wish to change, which will also contain a random row ID from the Contacts table
(a starting point to look one up) thanks to getting a random number from the checksum of a NewID(), modulus the maximum row number.
At this point you can then use the next stage of the CTEs to grab a real row number
(using TOP 1 where the row ID > the random number) and JOIN to that row, and you'll have full access to the AdventureWorks row to change a number of fields in one UPDATE.
Or you can have this simpler example, where ChangeData2 becomes a list of IDs and one new field from the randomly selected AdventureWorks row.
USE [MyDatabase];
With AW_Names as -- Get Maximum ID for range of random numbers
( SELECT Max(ContactID) as MaxID
FROM [AdventureWorks].[Person].[Contact]
),
ChangeData1 as -- Get ID numbers for table we want to change, and a random num for each row
(
SELECT
ID,
ABS(CHECKSUM(NewId())) % (SELECT MaxID FROM AW_Names) as Rando
FROM myTable
),
ChangeData2 as -- Get ID numbers for table we want to change, and a name from AdventureWorks for each random field
(
SELECT
ID AS ChangeID,
( SELECT TOP 1 LastName FROM [AdventureWorks].[Person].[Contact]
WHERE ContactID > Rando
) As NewField
FROM ChangeData1
)
UPDATE myTable
SET
Lastname = ChangeData2.NewField
FROM ChangeData2 WHERE ChangeData2.ChangeID=ID
It seems to run acceptably fast, even a more complex version that updates numerous fields at once, and randomises a Date of Birth column too. Thousands of records safely cleansed in the blink of an eye. This has become my starting point for a long script that will cleanse a number of tables.
Alternatively, you could take a look at Anonimatron, which was my first port of call before I decided to go it alone...
Anonimatron
This free software (a Java .jar) can whizz through tables and fields that you specify in an XML config file, using a variety of new data types (crazy ancient Roman names for example!) and can remember how it did it so that future runs will change the same fields to the same new values if you wish to keep certain parts of your data familiar.
See the author's blog for a guide :
http://rolfje.wordpress.com/2013/11/03/anonimatron-quick-start/
Having updated my Java installation (so that .jar programs.can run), I then downloaded Anonimatron from
http://sourceforge.net/projects/anonimatron/
I used the folder within the zip download to unzip into a folder :
C:\Program Files\Anonimatron\anonimatron-1.7\
I took a look at the batch file supplied, and converted it to Windows line feeds, and ran it. It complained about memory, not being able to "reserve enough space for object heap". Not a great intro to the joys of Java, but an opportunity to learn...
I eventually figured out what the
-Xmx2G parameter was doing, so I tried
-Xmx1600m instead, and got that much working. Keeping the original batch safe, I was now using a copy called
ano.bat.
Realising that I wanted another batch to play with (to call the main one) I set up
ano2.bat to call the first batch as required :
ano.bat -config config.xml -synonyms synonyms.xml
This was enough to do the trick, after I got it connecting to MS SQL, which called for the installation of a free 'JDBC' driver download and install.
Grab Microsoft JDBC Driver 4.0 for SQL Server from
http://www.microsoft.com/en-us/download/details.aspx?id=11774
and then place a copy of
sqljdbc4.jar from that installation into
C:\Program
Files\Anonimatron\anonimatron-1.7\jdbcdrivers
alongside the MySql and
PostgreSQL ones.
If you wish to use
integratedSecurity=true in the connection string, you'll need to copy
sqljdbc_auth.dll to the main
C:\Program
Files\Anonimatron\anonimatron-1.7\ folder
(or sort out the library path
issues!)
Example
If this command line works :
sqlcmd -S myServer\test -d myDatabase -Q "SELECT ID,LastName FROM myTable"
then this XML file config.xml should work - in this case changing a name column to wacky Life of Brian names like Augusnus, Dibocus, Ebodabogubinus etc.!
<?xml version="1.0" encoding="UTF-8"?>
<configuration jdbcurl="jdbc:sqlserver://myServer;instanceName=test;DatabaseName=myDatabase;integratedSecurity=true;">
<table name="myTable">
<column name="LastName" type="ROMAN_NAME" />
</table>
</configuration>
The command parameter -configexample is used to display help, running for me via the batch file:
ano.bat -configexample
This reveals the following possibilities for anonymising your data fields (and you can also write you own 'anonymizers' too, to bolt in) :
DEFAULT_TYPES
java.lang.String
java.util.Date
CUSTOM_TYPES
RANDOMCHARACTERS
DUTCHBANKACCOUNT
UUID
ELVEN_NAME
ROMAN_NAME
PREFETCHCHARACTERS
RANDOMDIGITS
EMAIL_ADDRESS
STRING
BURGERSERVICENUMMER
The author is responsive to requests for assistance on his blog, or via the dev website, so it's worth considering Anonimatron if it seems to suit your needs. I'm going down the route of rolling my own, any other suggestions are welcomed if you'd like to comment. Thanks.