Pipeline: implicit attribute creation


cc: @cseed @dking (new users can only cc 2 others…)

Currently in pipeline:

p = self.pipeline()
t = p.new_task()
t.command(f'echo "0" >> {t.ofile}')

Here, t.ofile doesn’t exist before use, implicitly created during access:

def _get_resource(self, item):
        if item not in self._resources:
            r = self._pipeline._new_task_resource_file(self)
            self._resources[item] = r
            self._resources_inverse[r] = item

        return self._resources[item]

    def __getitem__(self, item):
        return self._get_resource(item)

    def __getattr__(self, item):
        return self._get_resource(item)

This breaks the semantic promise of a getter, and provides 2 different ways of adding attributes to a Task, one explicit in declare_resource_group, and one implicit. It requires us to maintain 2 very similar methods, with subtle differences:

def declare_resource_group(self, **mappings):
        for name, d in mappings.items():
            assert name not in self._resources
            if not isinstance(d, dict):
                raise ValueError(f"value for name '{name}' is not a dict. Found '{type(d)}' instead.")
            rg = self._pipeline._new_resource_group(self, d)
            self._resources[name] = rg
        return self

 def _new_resource_group(self, source, mappings):
        assert isinstance(mappings, dict)
        root = self._tmp_file()
        d = {}
        new_resource_map = {}
        for name, code in mappings.items():
            if not isinstance(code, str):
                raise ValueError(f"value for name '{name}' is not a string. Found '{type(code)}' instead.")
            r = self._new_task_resource_file(source=source, value=eval(f'f"""{code}"""'))  # pylint: disable=W0123
            d[name] = r
            new_resource_map[r._uid] = r

        rg = ResourceGroup(source, root, **d)
        self._resource_map.update({rg._uid: rg})
        return rg

These differences appear when we run a command:

def handler(match_obj):
            groups = match_obj.groupdict()
            if groups['TASK']:
                raise ValueError(f"found a reference to a Task object in command '{command}'.")
            elif groups['PIPELINE']:
                raise ValueError(f"found a reference to a Pipeline object in command '{command}'.")
                assert groups['RESOURCE_FILE'] or groups['RESOURCE_GROUP']
                r_uid = match_obj.group()
                r = self._pipeline._resource_map.get(r_uid)
                if r is None:
                    raise KeyError(f"undefined resource '{r_uid}' in command '{command}'.\n"
                                   f"Hint: resources must be from the same pipeline as the current task.")
                if r._source != self:
                    #... Here we check if in r._source._mentioned
                    # But allow implicit creation when r._source == self

In the r._source != self case we don’t allow implicit creation, but do in the r._source == self case. This kind of polymorphism is clever, and I suspect error prone.

Ref: https://github.com/hail-is/hail/pull/5455#discussion_r260869446


For reference, here’s the PR where we discussed some alternatives: https://github.com/hail-is/hail/pull/4937

We had considered at one point having something like what’s below, but felt it was too redundant especially since we wanted all files to be temporary files unless otherwise specified:

p = Pipeline()
t = p.new_task()
    .command("cat {{ifile}} > {{tmp1}}; cat {{tmp1}} > {{ofile}}"))

Instead of using f-strings, I initially used the Jinja template engine to parse the variable names and substitute using the environment explicitly defined by the user

Given our design goals, we decided it was ok to have the implicit assignment of resources by overwriting __getattr__ for the benefit of having a simple, concise interface as long as we had enough error checking to detect improper uses of __getattr__.

The reason why the declare_resource_group is needed is for the case where the files share a common root and should be treated as one file group (ex: PLINK files with .bed, .bim, and .fam). It would have been ugly to do something like this fplink --bfile ... --make-bed --out {t.ofile.resource_group(bed=..., bim=..., fam=...)}.

I agree the implicit creation of objects is dangerous, but I couldn’t come up with a better way to have the interface we wanted given the limitations of Python (no dynamic scoping) and no i-strings.

I’m definitely open to hearing your suggestions for how to get around these constraints while maintaining something as close to a bash script as possible!


What about

p = Pipeline()
t = p.new_task()
t.ofile = t.new_file() #defaults to tmp file

t.command(f'echo "0" >> {t.ofile}')



I don’t think any user will stomach explicitly initializing the temporary files. I think this:

p = self.pipeline()
t = p.new_task()
t.command(f'echo "0" >> {t.ofile}')

is as much verbosity as our pipeline users will tolerate. Honestly, for very small tasks

p = self.pipeline()
p.shell_task(f'echo "0" >> {p.ofile}')

seems preferable, if we can make it work.

It certainly opens the possibility for goofs like t.ofile and t.offle, but these users are currently working in shell environments where they write echo "0" >> foo and then cat < foo. Attempts to enforce on bioinformaticians the kind of structures that we, as software engineers, accept has, so far, not captured large audiences. :woman_shrugging: we gotta play to our audience.


Something I think we should keep in mind is that Hail users often start by developing scripts interactively, but do need a system that is robust to make more productionized when those analyses look good. If we can make sure that we have extremely clear error messages and a good debugging process with implicit file creation, then I’m happy.

We should also be ready for people to start using python to generate the command strings inside these tasks, and think about how that process will go.