Generate realistic test data
As data professionals, we often need test data, whether for functional testing, to satisfy business logic criteria or for non-functional, to satisfy performance requirements. We must also not store any sensitive or personal information in non-production systems and doing so could be against Data Protection Regulations (GDPR).
A common approach is to refresh test environments from production and thus load production data for testing. Problem with this approach is that it may not fully satisfy our business logic. For example, we could have a business rule that awards customers born on February 29th but we may not have such customers. In that case, our production data would never trigger this particular business rule and we would never be able to validate it. The only way is to generate test customers born on February 29th
There are a number of online tools available to generate mock-up data. My favourite is https://www.generatedata.com by Benjamin Keen because it is Open Source, free and self-hosted. In the online version, we can only generate 100 records at a time, which one can increase after a donation. The self-hosted version does not have any limitation. Ben has done a fantastic job and I would urge you to donate on the author’s website.
Prerequisites
The data generator is a PHP/MySQL application and therefore requires MySQL and PHP installed on the machine. This could be either on Windows or Linux. I will be using Ubuntu Linux 18.04 for this demonstration. You can learn how to install Ubuntu virtual machine in Azure in my previous post
There is no need for a separate machine. You can install AMP (Apache, MySQL, PHP) locally on a Windows laptop. See https://ampps.com for details. I have chosen a dedicated VM as it makes it easier for me. I would, however, love to see Data Generator as a Docker container.
First and foremost, if you have just installed Ubuntu you need to refresh repositories:
apt-get update
Install Apache, PHP and MySQL
Install Apache
apt-get install apache2
Install MySQL
apt-get install mysql-server
Install PHP
apt-get install php php-mysql libapache2-mod-php
Restart Apache
systemctl restart apache2
Now we should have a working web server with PHP and MySQL support. Let’s test it:
wget localhost
should result in:

This means the Apache is responding to requests and served us index.html page.
Configure MySQL
Connect to the MySQL server with the root user:
mysql -u root -p
Create a new database:
mysql> create database datagenerator;
Create a new user:
mysql> create user 'datagenerator'@'localhost' identified by 'SomeNewPassword';
Now, grant the new user privileges to the database:
mysql> grant all privileges ON datagenerator . * TO 'datagenerator'@'localhost' identified by 'SomeNewPassword';
Reload privileges to take into effect:
flush privileges;
Install data generator
The guide is available on their GitHub page but I will take you through it step by step:
Download latest release:
wget https://github.com/benkeen/generatedata/archive/3.2.8.zip
Unzip the downloaded package. First, we need to install unzip:
apt-get install unzip
Now unzip the package:
unzip 3.2.8.zip
By default, the Apache webserver is looking for websites to be in /var/www/html
. This is defined by the DocumentRoot
directive in the Apache configuration. To see the configuration you can open it in the text editor. I use VIM:
vi /etc/apache2/sites-enabled/000-default.conf
And look for DocumentRoot variable:
to quit vim press Esc
, then :
and type q
and press Enter
Now, we have to copy the extracted package to the DocumentRoot folder:
cp -a generatedata-3.2.8/ /var/www/html/
Grant access to the cache folder as per documentation:
chmod 777 /var/www/html/generatedata-3.2.8/cache/
Now, navigate to your servers IP or DNS and follow the wizard:
Configure the MySQL connection with the information we have created earlier in this post:

On the next screens, you will be configuring User Account types and which plugins to install. After that, you will be able to start using your own data generator without any limitations… well, the only limitation is the performance of your VM and how quickly it can generate data sets. In my example, on a VM with 2 CPUs (Standard D2s v3) generating INSERT SQL Statement for 10000 records is instant. I have made mine accessible via the Internet for test purposes but you can keep yours within the local network, there is no need to expose it:

Result
An example of test customer data:

We can also generate an INSERT statement:

And voila!

Thanks for reading!
This post was originally published on March 30, 2020.