Being able to experiment with big data and queries in a safe and secure “sandbox” test environment is important to both IT and end business users as companies get going with big data. Nevertheless, setting up a big data sandbox test environment is different from establishing traditional test environments for transactional data and reports. Here are ten key strategies to keep in mind for building and managing big data sandboxes:
1. Data mart or master data repository?
The data base administrator needs to make a decision early-on as to whether to have test sandboxes use data directly from the master data repository that production uses, or whether the best solution is to replicate and splinter off sections of this data into separate data marts that are reserved for testing purposes only. The advantage of the full data repository is that testing actually uses data that is used in production, so test results will be more accurate. The disadvantage is that data contention can be created with production itself. With the data mart strategy, you don’t risk contention with production data—but the data will likely need to be periodically refreshed to stay in some degree of synchronization with data being used in production if it is going to closely approximate the production environment.
2. Work out scheduling
Scheduling is one of the most important big data sandbox activities. It ensures that all sandbox work is optimally being run. It usually achieves this by concurrently scheduling a group of smaller jobs that can be completed while a longer job is being run. In this way, resources are allocated to as many jobs as possible. The key to this process is for IT to sit down with the various user areas that are using sandboxes so everyone has an upfront understanding of the schedule, the rationale behind it, and when they can expect their jobs to run.
3. Set limits
If months go by without a specific data mart or sandbox being used, business users and IT should have mutually acceptable policies in place for purging these resources so they can be put back into a resource pool that can be re-provisioned for other activities. The test environment should be managed as effectively as its production environment counterpart so that resources are called into play only when they are actively being used.