0.2 beta checklist

This thread is to discuss the goals for 0.2 beta. These should be the main development priority.

  1. A stable file format. I will create a separate thread to discuss the format, changes currently planned and ways to support forward-compatible changes. The format should not be significantly larger than 0.1 or worse to read/write. We can argue what significant means, but >20-30% larger will be a deal breaker. (I think we’re already there even without the required changes or dependent array sizes.) From the beta forward, no requiring reimports.

  2. VariantDataset must be fully generic. No more Genotype type. Variant and sample should be renamed row and column. (This may cause some shock which we might need a strategy to mitigate.)

  3. The full Python interface should be largely finished. For the beta I think we can tolerate some instability in the Python API.

  4. Performance and memory requirements should not be significantly worse than 0.1. We can argue what significant means, but 10x worse is unacceptable, 2-3x might be OK depending on how often the method is used and how expensive it was to begin with.

    This includes import_bgen/linreg. The fast path got ripped out and some work will be needed to get back to the original level of performance.

    Porting the expression compiler over to region values will probably be necessary to meet this goal given the functionality that’s getting pushed into the expression language right now (e.g. split_multi and filter_alleles.)

  5. We should not lose functionality. Right now, I think the only thing that’s missing is linreg_burden (we just need to add a matrix group_by method.)

This is just for the beta. If the 0.2 release doesn’t clearly dominate 0.1 in both functionality and performance we’ve done something very wrong.

Have a natural way to read/write a VariantDataset as Parquet.