Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
2023-2-Storage, 2023-3-Storage
Description
测试版本:
master_0221_4dcd564
问题描述:”IoTDB-WAL-Recover“ 线程执行恢复操作非常慢(15分钟),现象如下:
监控这个线程的cpu:
jstack datanode的进程,见附件。
pid=1771 对应的线程:
cpu高期间对应的日志,,间隔15分钟日志才有更新:
Successfully recover WAL node in the directory /data/mpp_test/m_0221_4dcd564/sbin/../data/datanode/wal/root.test.g_1-11
测试流程:
1.启动3副本3C3D 集群 dataregion是IoT协议
3C: 192.168.10.72/73/74
3D: 192.168.10.72/73/74
ConfigNode env
MAX_HEAP_SIZE="8G"
DataNode env
MAX_HEAP_SIZE="256G"
MAX_DIRECT_MEMORY_SIZE="32G"
COMMON prop:
schema_replication_factor=3
data_replication_factor=3
enable_timed_flush_seq_memtable=true
seq_memtable_flush_interval_in_ms=600000
seq_memtable_flush_check_interval_in_ms=300000
enable_timed_flush_unseq_memtable=true
unseq_memtable_flush_interval_in_ms=600000
unseq_memtable_flush_check_interval_in_ms=300000
max_waiting_time_when_insert_blocked=3600000
query_timeout_threshold=3600000
2. 启动测试脚本
在192.168.10.71 /home/liuzhen/benchmark/bm_v1下的exec_iotdb_4380.sh 脚本,相同路径下还需要有start_db.sh
cat exec_iotdb_4380.sh
test_node="192.168.10.74"
sh iotdb.sh 4380 > 0215_4380_1.out &
sleep 15
ip74_dn_pid=`ssh liuzhen@192.168.10.74 "source /etc/profile;jps|grep -i datanode"`
v_pid=`echo ${ip74_dn_pid}|awk '
echo ${v_pid}
ssh liuzhen@192.168.10.74 "kill -9 ${v_pid}"
ssh liuzhen@192.168.10.74 "sudo sh -c \"sync; echo 3 > /proc/sys/vm/drop_caches\""
wait
sh -x ./start_db.sh ${test_node} > startdb.log
sleep 3
sh iotdb.sh 4380 > 0215_4380_2.out
sleep 5
ssh liuzhen@192.168.10.74 "source /etc/profile;/data/mpp_test/m_0221_4dcd564/sbin/stop-datanode.sh"
sleep 15
ip74_dn_pid=`ssh liuzhen@192.168.10.74 "source /etc/profile;jps|grep -i datanode"`
v_pid=`echo ${ip74_dn_pid}|awk '{print $1}
'`
echo ${v_pid}
ssh liuzhen@192.168.10.74 "kill -9 ${v_pid}"
sleep 3
ssh liuzhen@192.168.10.74 "cp -rp /data/mpp_test/m_0221_4dcd564/data /data/mpp_test/m_0221_4dcd564/data_for_recovery_Test"
sleep 3
ssh liuzhen@192.168.10.74 "sudo sh -c \"sync; echo 3 > /proc/sys/vm/drop_caches\""
sh -x ./start_db.sh ${test_node} >> startdb.log
sleep 3
sh iotdb.sh 4380 > 0215_4380_3.out
sleep 20
ssh liuzhen@192.168.10.74 "source /etc/profile;/data/mpp_test/m_0221_4dcd564/sbin/stop-datanode.sh"
sleep 10
ip74_dn_pid=`ssh liuzhen@192.168.10.74 "source /etc/profile;jps|grep -i datanode"`
v_pid=`echo ${ip74_dn_pid}|awk '
'`
echo ${v_pid}
ssh liuzhen@192.168.10.74 "kill -9 ${v_pid}"
sleep 30
ssh liuzhen@192.168.10.74 "cp -rp /data/mpp_test/m_0221_4dcd564/data /data/mpp_test/m_0221_4dcd564/data_for_recovery_Test_2"
sleep 3
ssh liuzhen@192.168.10.74 "sudo sh -c \"sync; echo 3 > /proc/sys/vm/drop_caches\""
sleep 2
sh -x ./start_db.sh ${test_node} >> startdb.log
执行脚本方式
nohup sh -x exec_iotdb_4380.sh > test.out &
查看脚本输出test.out
在脚本中的第2次启动ip74datanode时,出现重启恢复慢的现象。
benchmark执行时间不超过60秒。