On interval endpoints

I’d like to be able to make interval more general–basically adding the option to specify whether the endpoints should be included or excluded. (The reason I want to do this is because I need to use different kinds of intervals in the partitioner, but I think it’d be more generally nice to be able to have these).

TInterval would have 3 fields—itype, start, and end—where itype is an Int32 that contains the two bits of information (startInclusive and endInclusive) and Interval.contains, Interval.overlaps, and Interval.isEmpty would need to account for all the possibilities.

Seems cool to me. Current intervals are inclusive of both, right?

In scala / RV land, I’d use two actual boolean fields for programmer-sanity. Two booleans in RV-land should be pretty efficient anyway (I think 8 bytes each).

Current intervals are [a, b). I forgot we had booleans that were pretty small—that seems pretty good.

So one thing that I ran into while propagating up into user-visible-land:

  • I updated the toString representation (and the parser) to recognize ("[", "(") and ("]", ")") around an interval, and interpret accordingly. The JSON representation just stores the booleans as additional fields.
  • How much of this do we want to expose to users? Right now, I have the parser recognizing an interval as start-end both with enclosing brackets and without enclosing brackets (this defaults to [a, b)). This is to maintain compatibility with the python-side Interval.parse as it currently works (I don’t believe this version is necessary anywhere else) but I think trying to parse something of form [a, b) etc. would also work, and maybe it would be cleaner to require that from now on? I was going to document + expose in python in a future pull request, if we want that to happen.
  • I fixed up the IntervalList parser to take advantage of the new stuff—instead of taking end+1 as an exclusive endpoint, it just takes end as an inclusive one.

Are there other things that I might be missing?

I like [l,r) as a syntax, personally. I think some of our users will always prefer l-r, even though it’s kind of ambiguous.

I feel like @tpoterba is going to lament the genetics usability hit of this suggestion, but here goes:

  • we can do a good job on input. Support [a, b) syntax as well as 2:57-258 for interval,
  • parse_interval should support the above formats, but shouldn’t be a genetics-specific,
  • consistently use the precise and unambiguous [a, b) syntax on output,
  • add genetics_interval_str(interval) -> str (name? ugh.) that converts interval to the 2:57-258 syntax.

Thoughts?

1 Like