Uploaded image for project: 'Apache IoTDB'
  1. Apache IoTDB
  2. IOTDB-5568

”IoTDB-WAL-Recover“ thread recover very slowly

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • mpp-cluster
    • None
    • 2023-2-Storage, 2023-3-Storage

    Description

      测试版本:
      master_0221_4dcd564
      问题描述:”IoTDB-WAL-Recover“ 线程执行恢复操作非常慢(15分钟),现象如下:

      监控这个线程的cpu:

      jstack datanode的进程,见附件。
      pid=1771 对应的线程:

      cpu高期间对应的日志,,间隔15分钟日志才有更新:
      Successfully recover WAL node in the directory /data/mpp_test/m_0221_4dcd564/sbin/../data/datanode/wal/root.test.g_1-11

      测试流程:
      1.启动3副本3C3D 集群 dataregion是IoT协议
      3C: 192.168.10.72/73/74
      3D: 192.168.10.72/73/74
      ConfigNode env
      MAX_HEAP_SIZE="8G"

      DataNode env
      MAX_HEAP_SIZE="256G"
      MAX_DIRECT_MEMORY_SIZE="32G"

      COMMON prop:
      schema_replication_factor=3
      data_replication_factor=3
      enable_timed_flush_seq_memtable=true
      seq_memtable_flush_interval_in_ms=600000
      seq_memtable_flush_check_interval_in_ms=300000
      enable_timed_flush_unseq_memtable=true
      unseq_memtable_flush_interval_in_ms=600000
      unseq_memtable_flush_check_interval_in_ms=300000
      max_waiting_time_when_insert_blocked=3600000
      query_timeout_threshold=3600000

      2. 启动测试脚本
      在192.168.10.71 /home/liuzhen/benchmark/bm_v1下的exec_iotdb_4380.sh 脚本,相同路径下还需要有start_db.sh
      cat exec_iotdb_4380.sh
      test_node="192.168.10.74"
      sh iotdb.sh 4380 > 0215_4380_1.out &
      sleep 15
      ip74_dn_pid=`ssh liuzhen@192.168.10.74 "source /etc/profile;jps|grep -i datanode"`
      v_pid=`echo ${ip74_dn_pid}|awk '

      {print $1}'`
      echo ${v_pid}
      ssh liuzhen@192.168.10.74 "kill -9 ${v_pid}"
      ssh liuzhen@192.168.10.74 "sudo sh -c \"sync; echo 3 > /proc/sys/vm/drop_caches\""
      wait
      sh -x ./start_db.sh ${test_node} > startdb.log
      sleep 3
      sh iotdb.sh 4380 > 0215_4380_2.out
      sleep 5
      ssh liuzhen@192.168.10.74 "source /etc/profile;/data/mpp_test/m_0221_4dcd564/sbin/stop-datanode.sh"
      sleep 15
      ip74_dn_pid=`ssh liuzhen@192.168.10.74 "source /etc/profile;jps|grep -i datanode"`
      v_pid=`echo ${ip74_dn_pid}|awk '{print $1}

      '`
      echo ${v_pid}
      ssh liuzhen@192.168.10.74 "kill -9 ${v_pid}"

      sleep 3
      ssh liuzhen@192.168.10.74 "cp -rp /data/mpp_test/m_0221_4dcd564/data /data/mpp_test/m_0221_4dcd564/data_for_recovery_Test"
      sleep 3
      ssh liuzhen@192.168.10.74 "sudo sh -c \"sync; echo 3 > /proc/sys/vm/drop_caches\""
      sh -x ./start_db.sh ${test_node} >> startdb.log
      sleep 3
      sh iotdb.sh 4380 > 0215_4380_3.out
      sleep 20
      ssh liuzhen@192.168.10.74 "source /etc/profile;/data/mpp_test/m_0221_4dcd564/sbin/stop-datanode.sh"
      sleep 10
      ip74_dn_pid=`ssh liuzhen@192.168.10.74 "source /etc/profile;jps|grep -i datanode"`
      v_pid=`echo ${ip74_dn_pid}|awk '

      {print $1}

      '`
      echo ${v_pid}
      ssh liuzhen@192.168.10.74 "kill -9 ${v_pid}"

      sleep 30
      ssh liuzhen@192.168.10.74 "cp -rp /data/mpp_test/m_0221_4dcd564/data /data/mpp_test/m_0221_4dcd564/data_for_recovery_Test_2"
      sleep 3
      ssh liuzhen@192.168.10.74 "sudo sh -c \"sync; echo 3 > /proc/sys/vm/drop_caches\""
      sleep 2
      sh -x ./start_db.sh ${test_node} >> startdb.log

      执行脚本方式
      nohup sh -x exec_iotdb_4380.sh > test.out &
      查看脚本输出test.out
      在脚本中的第2次启动ip74datanode时,出现重启恢复慢的现象。
      benchmark执行时间不超过60秒。

      Attachments

        1. dn.out
          63 kB
          刘珍
        2. exec_iotdb_4380.sh
          2 kB
          刘珍
        3. image-2023-02-21-17-02-08-057.png
          99 kB
          刘珍
        4. image-2023-02-21-17-02-23-147.png
          32 kB
          刘珍
        5. image-2023-02-21-17-03-43-288.png
          13 kB
          刘珍
        6. image-2023-02-21-17-04-20-970.png
          99 kB
          刘珍
        7. iotdb_4380.conf
          14 kB
          刘珍
        8. ip74_dn_recovery_slow_logs.tar.gz
          123 kB
          刘珍
        9. start_db.sh
          0.6 kB
          刘珍

        Activity

          People

            HeimingZ Haiming Zhu
            刘珍 刘珍
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: