Changed String.substring() Behavior

One of my tools was running extremely slow since I ported it to run with my new sources, and I always wondered why.

Time for a Night Shift

One of the useful tools I’m using in my DWG/DXF viewer project is a categorizer. It runs over the complete bunch of my test files and categorizes them in several categories like files containing DIMENSION entities or files of a given version or files which write warnings during read. It then creates a directory tree of the categories and their values and sets symbolic links to matching files. This allows to easily access files which provide a special feature for testing.

In the code base related to the predecessor of my viewer categorizing all files took some 30 minutes. That was DXF only, so when running the categorizer in my new source base I expected some duplication because now it hat to handle DWG, too. But instead categorization took 20 times longer: 10h. As I was too busy to look into that more thoroughly I only played around with the underlying thread pool but that was only helping to avoid out of memory problems. So in the last months categorization was something I preferably started in the evening when finishing work so it was done during the night.

Nag Nag Nag

I’m always unhappy if I don’t understand something, and so I had a continuing nagging feeling regarding this issue, and today I eventually had a look what is causing the problem. I had already added some output code for the categorizer which showed me in which directory it was hanging a while ago, but that wasn’t really helpful as the files for which it took ages where not really special besides being quite large and importing lots of external images. But I have much larger files, and had already switched off image loading during categorization.

The problem was easily identified (which obviously means I should have looked earlier): it’s the substring() method of java.lang.toString(). One of my categorizations was looking for exceptions when writing the file back as DXF. And the offending files contained large binary arrays, which are written as hexcode strings to DXF, cut into lines of 254 chars. My code was converting the byte array into a hexcode string, then cutting out the part for the successive lines with substring() which looked like

  public void write(int code, @NotNull byte[] value) throws AcException
  {
    String str = DxfByteArrayGroup.toString(value);
    if (str.length() > MAX_LINE_LENGTH) {
      // split
      while (str.length() > MAX_LINE_LENGTH) {
        writeString((short)code, str.substring(0, MAX_LINE_LENGTH));
        str = str.substring(MAX_LINE_LENGTH);
      }
      if (!str.isEmpty()) {
        writeString((short)code, str);
      }
    }
    else {
      writeString((short)code, str);
    }
  }

But one of the changes between the old and the new version was a switch of the Java version from 1.6 to 1.7 (some of my customers are quite conservative). Between these versions the internal semantics of substring() changed. In 1.6 and before the underlying char array was shared between a String and the Strings returned by substring(). Since 1.7 each string has its own copy of the given array. This was done as a fix for Bug-ID 6294060 where in combination with StringBuffer there were a memory leak. To my feeling the fix is a bit overdoing it.

So before the change substring() was running in constant time, but now it needs linear time. The byte array which was written was some 25MBytes large, resulting in a 50MChar string. So the second call to substring() in my while loop was copying in average half of this string some 200,000 times. That is some 10TBytes of copying just for outputting one byte array. Ouch!

In a first step I changed my code to cut out only the small part which is really written from the string:

  public void write(int code, @NotNull byte[] value) throws AcException
  {
    final String str = DxfByteArrayGroup.toString(value);
    final int len = str.length();
    if (len > MAX_LINE_LENGTH) {
      // split
      final short shortCode = (short)code;
      final int numFullGroups = len / MAX_LINE_LENGTH;
      final int rest = len % MAX_LINE_LENGTH;
      for (int g = 0;  g < numFullGroups; ++g) {
        final int pos = g * MAX_LINE_LENGTH;
        writeString(shortCode, str.substring(pos, pos + MAX_LINE_LENGTH));
      }
      if (rest != 0) {
        writeString(shortCode, str.substring(numFullGroups * MAX_LINE_LENGTH));
      }
    }
    else {
      writeString((short)code, str);
    }
  }

But knowing that such extremely large byte arrays can appear I decided to completely rewrite the method, so that it only handles the bytes really written for each line, and reuses a StringBuilder:

  public void write(int code, @NotNull byte[] value) throws AcException
  {
    final short shortCode = (short)code;
    final int len = value.length;
    if (len == 0) {  // handle special case
      writeString(shortCode, Empty.STRING);
      return;
    }
    final StringBuilder sb = new StringBuilder(MAX_LINE_LENGTH);
    for (int i = 0;  i < len;  ++i) {
      sb.append(DxfByteArrayGroup.byteToString(value[i]));
      if (i % MAX_BYTES_PER_LINE == MAX_BYTES_PER_LINE - 1) {
        // reached end of line, so write it out
        writeString(shortCode, sb.toString());
        sb.setLength(0); // clear buffer
      }
    }
    if (sb.length() > 0) {
      // have a rest 
      writeString(shortCode, sb.toString());
    }
  }

The results were convincing:

Method Duration of Categorization
old with recursive substring() (3 threads) 36000 s
intermediate with cutout substring() (3 threads) 3179 s
final with StringBuilder reusage (6 threads) 2240 s

As expected the last improvement is not really showing in the categorization, as the writing is only a small part of what it is doing, and I doubled the threads which run in parallel. I had reduced them in the past because I ran into OutOfMemory problems. This is no wonder because all 3 tasks were hanging with basically the same file: the original DWG, a DXF created from it by AutoCAD and a DXF created by another tool.

So now I’m back to a state where I can easily add a new category and rerun the categorization when I need to find files with a special feature. Some 40 minutes is okay for a short break or doing something else.