On interval endpoints


#1

I’d like to be able to make interval more general–basically adding the option to specify whether the endpoints should be included or excluded. (The reason I want to do this is because I need to use different kinds of intervals in the partitioner, but I think it’d be more generally nice to be able to have these).

TInterval would have 3 fields—itype, start, and end—where itype is an Int32 that contains the two bits of information (startInclusive and endInclusive) and Interval.contains, Interval.overlaps, and Interval.isEmpty would need to account for all the possibilities.


#2

Seems cool to me. Current intervals are inclusive of both, right?

In scala / RV land, I’d use two actual boolean fields for programmer-sanity. Two booleans in RV-land should be pretty efficient anyway (I think 8 bytes each).


#3

Current intervals are [a, b). I forgot we had booleans that were pretty small—that seems pretty good.


#4

So one thing that I ran into while propagating up into user-visible-land:

  • I updated the toString representation (and the parser) to recognize ("[", "(") and ("]", ")") around an interval, and interpret accordingly. The JSON representation just stores the booleans as additional fields.
  • How much of this do we want to expose to users? Right now, I have the parser recognizing an interval as start-end both with enclosing brackets and without enclosing brackets (this defaults to [a, b)). This is to maintain compatibility with the python-side Interval.parse as it currently works (I don’t believe this version is necessary anywhere else) but I think trying to parse something of form [a, b) etc. would also work, and maybe it would be cleaner to require that from now on? I was going to document + expose in python in a future pull request, if we want that to happen.
  • I fixed up the IntervalList parser to take advantage of the new stuff—instead of taking end+1 as an exclusive endpoint, it just takes end as an inclusive one.

Are there other things that I might be missing?


#5

I like [l,r) as a syntax, personally. I think some of our users will always prefer l-r, even though it’s kind of ambiguous.


#6

I feel like @tpoterba is going to lament the genetics usability hit of this suggestion, but here goes:

  • we can do a good job on input. Support [a, b) syntax as well as 2:57-258 for interval,
  • parse_interval should support the above formats, but shouldn’t be a genetics-specific,
  • consistently use the precise and unambiguous [a, b) syntax on output,
  • add genetics_interval_str(interval) -> str (name? ugh.) that converts interval to the 2:57-258 syntax.

Thoughts?