Currently, define_function is only implemented in the Spark backend, and it calls into the JVM to set a piece of state on a static object to define a function. I propose that in order to support the client/server execution model, we redesign define_function to record the function mapping only on the Python client, and change the format of the message communicated to the server to include registered functions.
Currently, the message sent to the server to execute is a single IR string. We can change this message to a JSON mapping with two fields (for now… reference genomes could be sent similarly, though): functions and ir. ir will be a string, and functions will be a list of mappings with the same fields as the Python function object: param_types, ret_type, mname,type_args, and f (str).
Instead of changing the JSON as you describe (which is also a fine solution), I was imagining a “IR Module” which contains a set of “IR Functions” along with either (a) a special “main” or “apply” function which is the main function to invoke for the pipeline, or (b) a designated IR which is the value of the module. The module should be what the query optimizer operates on.
Something to keep in mind, although we don’t need to do this now: it would be nice for users to be able to register (1) user-defined functions, and (2) entire pipelines (which take arguments) which are pre-compiled for low-latency requests. The problem with user-defined functions is we can’t compile them beforehand without knowing the physical types they’ll be instantiated with. This is what I imagine the gnomAD browser would want to use.