Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-3845

Kylin build error If the Kafka data source lacks selected dimensions or metrics in the kylin stream build.

    XMLWordPrintableJSON

Details

    Description

      Hi dear team:
      I'm developing OLAP Platform based on Kylin2.5.2. During my work, I build a streaming cube from Kafka source using kafka demo.
      In my streaming project, I set country、currency as dimensions and userId as metrics. But the cube build failed in 3rd step("Extract Fact Table Distinct Columns"). The exception is java.lang.ArrayIndexOutOfBoundsException.
      This is logs:
      2019-03-02 14:21:01,492 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do cleanup, available memory: 1334m
      2019-03-02 14:21:01,492 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Total rows: 127
      2019-03-02 14:21:01,492 INFO [main] org.apache.hadoop.mapred.MapTask: Finished spill 0
      2019-03-02 14:21:01,492 INFO [main] org.apache.hadoop.mapred.YarnChild: Exception running child: java.lang.ArrayIndexOutOfBoundsException:2
      2019-03-02 14:21:01,492 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do cleanup, available memory: 1334m
      at org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper.doMap(FactDistinctColumnsMapper.java:177)
      at org.apache.kylin.engine.mr.KylinMapper.map(KylinMapper.java:77)
      at org.apache.hadoop.mapreduce.Mapper.run(MapperTask.java:146)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:187)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781)
      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java;180)

      Then I find that in Kafka datasource, some streaming data lack the userId column. Most of the streaming data(contry, currency,userId) is ("China","CNY","843c4d");but a small amount of data lack userId, some data is ("China","CNY"). so when run the 3rd step("Extract Fact Table Distinct Columns"),MR engine will throw exception if the streaming data lack userId.

      The I check the source of Kylin, FactDistinctColumnsMapper.java:

      public void doMap(KEYIN key, Object record, Context context) throws IOException, InterruptedException {
      Collection<String[]> rowCollection = flatTableInputFormat.parseMapperInput(record);

      for (String[] row : rowCollection) {
      context.getCounter(RawDataCounter.BYTES).increment(countSizeInBytes(row));
      for (int i = 0; i < allCols.size(); i++) {
      String fieldValue = row[columnIndex[i]];
      if (fieldValue == null)
      continue;

      final DataType type = allCols.get.getType();
      ...

      I find that columnIndex[i] is equal with the size of row if the streaming data lack one column. So the row[columnIndex[i]] will throw the ArrayIndexOutOfBoundsException. So I change this code, check the columnIndex[i] and the size of row. If columnIndex[i] is equal with or larger than the size of row, I set fieldValue empty value. And After I change my code, the 3rd step("Extract Fact Table Distinct Columns") will run success.

      Those are what I found, which will cause problem for developers.
      How do you think?

      Best regard
      jintao

      Attachments

        Issue Links

          Activity

            People

              zhao jintao zhao jintao
              zhao jintao zhao jintao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 0.5h
                  0.5h
                  Remaining:
                  Remaining Estimate - 0.5h
                  0.5h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified