Yesterday encountered a customer’s test environment HBase startup exception, the surface phenomenon is from the CDH interface can not see the HMaster’s master and standby state, showing that the Master has been in the initialized state, through the hbcheck command execution returned error message is: org.apache.haoop.hbase.PleaseHoldException: Master is initializing … hmi module lcd
Since it is an HMaster exception, first go to the /var/log/hbase directory to view the Master’s logs, and the error is as follows:
Error message: Reported time is too far out of sync with master , indicating that there is a problem with clock synchronization, so check the clock synchronization of each node and find that the current clock synchronization is normal. It is suspected that this environment has had a serious clock problem before, such as a sudden modification of the system clock or time zone, resulting in a large time change on the node.
But anyway, let’s see how to solve this problem first~
So we continued to look at the log (as above), and found that the log reported a lot of WARN information: Found a log (hdfs://nameservice1/hbase/oldWALs/xxx) new than current time, probably a clock skew。 The “Found a log” key searched all the Master log files, which pointed to the files below the oldWALs.
So I checked the path of oldWALs below hbase and found that the time of this directory did vary greatly. As you can see from the screenshot below, the time of the /hbase/oldWALs directory is actually 2025-07-09, and there are /hbase/MasterProcWALs and /hbase/WALs directories. This shows that this environment has indeed changed dramatically on the clock.
hadoop fs -ls /hbase
1
Since the oldWAL file can be deleted, it was decided at that time to delete all the logs below the oldwaLs and try to restart HBase. (I didn’t look closely at the time of each directory at the time, but just thought there was a problem with the oldWALs directory according to the log prompts, so I didn’t think of deleting the things below the /hbase/MasterProcWALs and /hbase/WALs directories)
hadoop fs -rmr hdfs://nameservice1/hbase/oldWALs
1
After deleting the oldWALs, restart HBase and find that HBase still starts abnormally. Looking at the HBase log, we found the following error: org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from hdfs://nameservice1/hbase/meta/xxx/info/xxx
Based on the above error message, it is suspected that the data file of the HBase meta table may be corrupted, and we have also verified this separately using the following command:
hdfs fsck /hbase/data/hbase/meta/1588230740/info/e029xxx
According to this error phenomenon, find a solution to a similar problem, see https://forum.huawei.com/enterprise/zh/thread-870917.html, the method described in the article is generally to remove the relevant file backup of the meta table, and then regenerate the meta data through hbase hbck -repair. The key steps are as follows:
1. 停止HBase
2. 备份meta目录
hdfs dfs -mv /user/hbase /user/hbase_bak
hdfs dfs -mkdir /user/hbase
hdfs dfs -mv /hbase/data/hbase/namespace /user/hbase/namespace
hdfs dfs -mv /hbase/data/hbase/meta /user/hbase/meta
3. 启动HBase
4. 修复meta数据
hbase hbck -repair
After completing the above steps, I found that HBase can now start successfully, but some of the tables in HBase still have problems, for example, I found that there is a TRAFODION metadata table that uses list to see the table, but scan can not see it.
This phenomenon indicates that there are still inconsistencies in the metadata information of some HBase tables. So I did a whole check using HBCK.
sudo -u hbase hbase hbck
1
The following command returns a TRAFODION._MD_ appears with 25 inconsistent objects. TEXT is one of them.
So how should the problem of inconsistencies in this HBCK be solved? The way to think is naturally to use hbck’s repair command. hbck provides different parameters to solve the situation that the HBase region is in the RIT or cannot be launched normally, and there are many posts about hbck on the Internet, such as https://bbs.huaweicloud.com/blogs/353332.
In response to the problem I encountered here, I almost used the relevant repair parameters of hbase hbck, including -repair, -fixMeta, -fixAssignments, but none of them worked.
So I took a closer look at hbck’s error message: Region {Meta => null, hdfs => hdfs://nameservice1/hbase/data/TRAF_RSRVD_1/TRAFODION._MD_.TEXT/xxx, deployed =>, replicaId => 0} was recently modified
ERROR: There is a hole in the region chain between and . You need to create a new .regioninfo and region dir in hdfs to plug the hole.
There are many articles on the Internet that say that this error is either missing .regioninfo or missing meta information. I went to the directory corresponding to the hdfs table to see that the .regioninfo file exists, but there is really no relevant information in the meta table, since this is the case, then theoretically through -fixMeta should be able to repair meta data, but in fact – fixMeta has no actual effect after executing .
So I tried to use the name of the Region in the path, i.e. 548bbceb2ada1359656b30939f1c7f0e to try to go assign to make it live, but the assign command prompted that the union did not exist.
So I suspect that the .regioninfo information may be wrong, so I simply moved the .regioninfo file and then reused hbck -repair to repair and generate the .regioninfo file.
hadoop fs -mkdir /user/hbase/548bbceb2ada1359656b30939f1c7f0e
hadoop fs -mv /hbase/data/TRAF_RSRVD_1/TRAFODION.\_MD\_.TEXT/548bbceb2ada1359656b30939f1c7f0e/.regioninfo /user/hbase/548bbceb2ada1359656b30939f1c7f0e
sudo -u hbase hbase hbck -repair TRAF_RSRVD_1:TRAFODION.\_MD\_.TEXT
After the above wave of operations, the inconsistency problem corresponding to this table was actually solved. Using the same method, using the -repair method on the remaining 24 objects in turn also fixed the problem.
At all, this HBase startup problem is a complete fix, although, but as to why the last .regioninfo has a problem and what the problem is, why not delete the .regioninfo and then repair can be fixed, some of the details of this problem I have not yet studied clearly. Anyway, record the recovery process first, and then have time to study it later.