Monday, 24 March 2014

Anonymising Client Data

(I'm back working again after many long months unemployed, so I should be able to post little odds and ends here again)

If you want to demo your application using real data that you've brought back to the office from a client site, it's rather important to remove real names and addresses, etc., from the data before letting other parties set eyes on it. It won't inspire much confidence to show a potential client another client's data!

If you're using MS SQL Server then you may have the AdventureWorks database to hand, chock full of random names that can be used to overwrite the names and addresses that you wish to remove. Getting a random number to work in a simple UPDATE statement is a bit tricky because the RAND() is only evaluated once - so you'll get a different result each time but all the rows will be set the same :(

So I called upon my favourite CTE know-how and cooked up the following. It's a cascading multi-step CTE affair. First it retrieves the Max(ContactID)from the Contact table, so that the 0-to-1 random number can be scaled up to the full range of the table from which we'll be grabbing random names.

Next, we create a "ChangeData1" temporary table of IDs for the table we wish to change, which will also contain a random row ID from the Contacts table (a starting point to look one up) thanks to getting a random number from the checksum of a NewID(), modulus the maximum row number.

At this point you can then use the next stage of the CTEs to grab a real row number (using TOP 1 where the row ID > the random number) and JOIN to that row, and you'll have full access to the AdventureWorks row to change a number of fields in one UPDATE.

Or you can have this simpler example, where ChangeData2 becomes a list of IDs and one new field from the randomly selected AdventureWorks row.

USE [MyDatabase];

With AW_Names as  -- Get Maximum ID for range of random numbers
( SELECT Max(ContactID) as MaxID
  FROM [AdventureWorks].[Person].[Contact]
),
ChangeData1 as  -- Get ID numbers for table we want to change, and a random num for each row
(
 SELECT
   ID,
   ABS(CHECKSUM(NewId())) % (SELECT MaxID FROM AW_Names) as Rando
 FROM myTable
),
ChangeData2 as  -- Get ID numbers for table we want to change, and a name from AdventureWorks for each random field
(
 SELECT
   ID AS ChangeID,
   ( SELECT TOP 1 LastName FROM [AdventureWorks].[Person].[Contact]
     WHERE ContactID > Rando
   ) As NewField
 FROM ChangeData1
)
UPDATE myTable
SET
  Lastname = ChangeData2.NewField
FROM ChangeData2 WHERE ChangeData2.ChangeID=ID





It seems to run acceptably fast, even a more complex version that updates numerous fields at once, and randomises a Date of Birth column too. Thousands of records safely cleansed in the blink of an eye. This has become my starting point for a long script that will cleanse a number of tables.

Alternatively, you could take a look at Anonimatron, which was my first port of call before I decided to go it alone...

Anonimatron

This free software (a Java .jar) can whizz through tables and fields that you specify in an XML config file, using a variety of new data types (crazy ancient Roman names for example!) and can remember how it did it so that future runs will change the same fields to the same new values if you wish to keep certain parts of your data familiar.

See the author's blog for a guide : http://rolfje.wordpress.com/2013/11/03/anonimatron-quick-start/

Having updated my Java installation (so that .jar programs.can run), I then downloaded Anonimatron from http://sourceforge.net/projects/anonimatron/

I used the folder within the zip download to unzip into a folder :
C:\Program Files\Anonimatron\anonimatron-1.7\
I took a look at the batch file supplied, and converted it to Windows line feeds, and ran it. It complained about memory, not being able to "reserve enough space for object heap". Not a great intro to the joys of Java, but an opportunity to learn...

I eventually figured out what the -Xmx2G parameter was doing, so I tried -Xmx1600m instead, and got that much working. Keeping the original batch safe, I was now using a copy called ano.bat.

Realising that I wanted another batch to play with (to call the main one) I set up ano2.bat to call the first batch as required :
ano.bat -config config.xml -synonyms synonyms.xml

This was enough to do the trick, after I got it connecting to MS SQL, which called for the installation of a free 'JDBC' driver download and install.
Grab Microsoft JDBC Driver 4.0 for SQL Server  from http://www.microsoft.com/en-us/download/details.aspx?id=11774
and then place a copy of sqljdbc4.jar from that installation into
C:\Program Files\Anonimatron\anonimatron-1.7\jdbcdrivers
alongside the MySql and PostgreSQL ones.
If you wish to use integratedSecurity=true in the connection string, you'll need to copy sqljdbc_auth.dll to the main C:\Program Files\Anonimatron\anonimatron-1.7\  folder (or sort out the library path issues!)

Example

If this command line works :
sqlcmd -S myServer\test -d myDatabase -Q "SELECT ID,LastName FROM myTable"
then this XML file config.xml should work - in this case changing a name column to wacky Life of Brian names like Augusnus, Dibocus, Ebodabogubinus etc.!


<?xml version="1.0" encoding="UTF-8"?>
<configuration jdbcurl="jdbc:sqlserver://myServer;instanceName=test;DatabaseName=myDatabase;integratedSecurity=true;">

  <table name="myTable">
    <column name="LastName" type="ROMAN_NAME" />
  </table>

</configuration>


The command parameter -configexample is used to display help, running for me via the batch file:
ano.bat -configexample

This reveals the following possibilities for anonymising your data fields (and you can also write you own 'anonymizers' too, to bolt in)  :

DEFAULT_TYPES
java.lang.String
java.util.Date

CUSTOM_TYPES
RANDOMCHARACTERS
DUTCHBANKACCOUNT
UUID
ELVEN_NAME
ROMAN_NAME
PREFETCHCHARACTERS
RANDOMDIGITS
EMAIL_ADDRESS
STRING
BURGERSERVICENUMMER

The author is responsive to requests for assistance on his blog, or via the dev website, so it's worth considering Anonimatron if it seems to suit your needs. I'm going down the route of rolling my own, any other suggestions are welcomed if you'd like to comment. Thanks.

No comments:

Post a Comment