Dealing with Large Numbers of Fields

Dealing with Large Numbers of Fields

In the past, we encountered a Spark bug when the number of columns in a Row was so large that the code generated for equality and ordering between the Row was larger than the Java method size limit.

We have also occasionally generated JVM methods that are too large to be compiled by the JIT. This produces rather tremendous performance loss.

Most recently, we introduced a Hail-specific “Method Code Too Large” bug when a user attempted to rename a table with 1000s of fields. This rename used a new code path that triggered IR generation. The IR code literally copied the struct with no modifications. Longer term, we would like rename to be a no-op. However, this revealed a more fundamental issue: annotate, transmute, et al. will all fail on tables with large numbers of columns because all of these methods will attempt to generate a struct with 1000s of fields. The byte code to generate that struct will necessarily exceed the JVM method code size limit.

Short Term Fix

Amanda added a safe guard in https://github.com/hail-is/hail/pull/3233#event-1542109178 which uses the annotation path for tables with more than 500 fields.

This buys us time until we are forced to remove the AST.

Options for Long Term Fixes

Fewer JVM Bytecodes

In certain circumstances, we can avoid generating byte codes linear in the number of fields. For example, if a struct is copied without modification, then we may memcpy its bytes. If only one field of a struct is modified, we can memcpy the left and right hand side of the field, as long as the new missing bits don’t change the layout. It’s unclear how far we can get with this strategy alone.

Many JVM Methods

Using class variables to store StagedRegionValueBuilder state, we can break the JVM code into smaller methods that are individually below the JVM method size limit. Consider, for example, generating a method for each field of the top-level struct, or for a group of fields. The JVM class file spec uses two bytes to store the number of methods, so we certainly cannot exceed 65535 methods. I do not know if there exists a lower limit on method count.

Generate x86 Assembly

This is viable long term, but we will probably remove AST before we can implement this approach.

I agree that the long-term solution is to (generate a language that) generates native code without these limits. The JVM really wasn’t built to be a code generator target. Also, at the rate we’re going, I agree AST will be out before we make it to C++ codegen.

So what do we do? I suggest the following:

  • Store arguments and bindings (let, ArrayMap, etc.) in class members,
  • There are a small number of IR nodes that can have a large number of children, mostly constructors for aggregates: MakeArray, MakeStruct, InsertFields, MakeTuple, etc. These routines (if the number of children is sufficiently large?) should emit the children in a subfunction. We should keep rough track of the bytecode size being emitted, and generate multiple subfunctions if necessary. Maybe count the number of IR nodes that will go into the function and estimate a conservative conversion factor between IR nodes and bytecode? The target of course should be below the JVM huge code size function limit (8K?) which is far below the bytecode size limit, so this is likely to be pretty safe.
  • This assumes that the size of the code, modulo large aggregate constructions, is relatively small. I think that is often, but won’t always be the case. I would save this for a second pass, but I think we can apply the same idea to the toplevel code: estimating bytecode size, break it into pieces and generate each piece in a separate (but maximally large <= 8K) bytecode function.