How to recover data from corrupt Elasticsearch Indices in 3 simple steps

Published March 21, 2017 |

Elasticsearch is a popular big data search engine based on Apache Lucene and it does not really need a big introduction because of its popularity. This blog will try to answer your questions regarding data recovery in case you have a corrupt index. While it is not commonplace, it might happen. It is good to apprise yourself about the possibilities and be prepared for the worst.

What are the conditions for recovering corrupt data?

You can recover and re-index data in a few simple steps if your files are in place and follow certain conditions. Here a few that are a pre-requisite in case you want to follow the steps discussed below.

All the required fields need to be stored as Apache Lucene files.
The cluster that has the corrupt file should be in running state.
All nodes should be up in the cluster.

Once these three conditions are met, you can proceed to the three-step formula that will help you recover the valuable data from corrupt indices.

How to recover data?

Step 1: Identify the corrupt shard

Use UNASSIGNED state to identify the shard. There are numerous ways to do that. You can choose the one that you are comfortable with. One possible way is to use the curl request. Once you have identified the shard, proceed to identify the directory using the cluster name.

Step 2: Read data of corrupted shard

Elasticsearch makes use of Apache Lucene’s RESTful search engine which has 4 types of primary files that are used to recover data. A basic understanding of these is important before you start reading the data of the corrupt file.

Fields: The extension for this type of file is .fnm and these files store information about fields.
Field Data: The files with extension .fdt are for field data and store documents.
Segment Info: The extension for such files in .si and this stores information about segments.
Field Index: The extension is .fdx and stores information related to pointers.

When you start reading the data of the corrupt shard, you would need to use all the four file types. Once you have identified the data that you want to retrieve then you can create a new index for the data.

Step 3: Create a new index

Elasticsearch allows you to create a new index for the data. The only consideration here is to ensure that you have removed the uid and source field before indexing. This is required because Lucene automatically includes this in the indexed file and unless you disable the choice, this will become a part of your index.

How to prevent data loss?

Accidents do happen and the answer to the above question is to equip you to handle a big data accident. But, it is not important that you wait for an accident to occur. You can take some preventive measures that will if not stop the accident, can at least minimize the impact.
The strategy, in this case, is Data Replication. The name makes it obvious that it refers to creating a back-up or replica of the data. The administrator can adjust the replication factor according to the number of replicas required. Data replication is an effective preventive measure for data loss. However, this has its own cons.
Big data is identified with volume. And, with big data sets, storage is already a matter of concern. In such cases, it is sometimes practically not possible to have a replica of the data and this is the biggest reason why big data developers do not take back-ups. But, if you do not have budget constraints then it would be feasible to have a data replication system in place to avoid any mishaps.

Why do I need to have a replica?

Every technology has an inbuilt security function and Elasticsearch is no exception. But, there have been recent reports where Elasticsearch has been vulnerable to ransomware. The creators have claimed that it is not a problem with the Software but with the settings. No matter where the problem is if your valuable data is lost, then, it is gone forever. This prompts big data architects, developers, to take a back-up and have robust security. It is advisable that you follow the recommended settings.
Big data serves a great purpose in different companies and therefore any glitch in the technology that you are using to maintain your big data can act as a hindrance to a business process. It is important to keep yourself aware of the challenges that might crop up and continuously update your technology. This enables you to be better prepared for resolving issues that might crop up when you face challenges. And, also help you eliminate concerns that the service-provider is tackling in the recent updates.