RFE: .so filters

Fri Jan 10 16:57:02 CET 2014

On Fri, Jan 10, 2014 at 10:06 AM, John Keeping <john at keeping.me.uk> wrote:
>
> This seems drastically over complicated.

So here's the situation. There's a lot of "state" that we're taking
advantage of in using processes that terminate, that needs to be
replicated:

  *a* Sending arguments to the program, and distinguishing these
arguments from data [via argv in main]
  *b* When we are finished sending data to the filter [via a closed
file descriptor]
  *c* When the filter is finished sending data to cgit [via the filter
process terminating / waitpid]

If we skim on any one of these requirements, we introduce either
limited functionality or race conditions. To fully replicate these
required state transitions, we must either:

  *1* Use an out of band messaging mechanism, such as unix signals
(what I've implemented in jd/longfilters, for example)
  *2* Use two file descriptors (which then would require the filter to
select() or similar)
  *3* Come up with an encoding scheme that would separate these
messages from the data (which would then require the client to know
about it)

I don't really like any of these possibilities. I've implemented *1*
already, and while it works, it's a hassle to implement the signal
handling without races in the filter because of the *b* requirement
above. *2* is even harder to implement in simple scripts, so that's
out. And *3* is a full blown disaster, which would be so invasive that
we might as well use shared libraries if we're going to use this. So
that's out.

What all of this points to is the fact that persistent filters are not
going to wind up being a general thing available for all filter types.
I'm going to implement specifically email filters using it, and it's
going to have a domain specific encoding scheme:

  * the filter receives the email address on one line
  * the filter receives the data to filter on the next line
  * the filter then spits out its filtered data on a single line

This specificity is obviously unsuitable for any multiline filtering
or filtering of binary data. But it is simple enough to implement in
scripts that I'm fine with it.

It will require these changes:

  *a* Allowing persistent filter processes, with proper start-up /
tear-down times and pipe preservation (already implemented in
jd/longfilters)
  *b* Not dup2()ing the pipe to stdin/stdout, so that the filter close
function can read from the pipe itself, and block until it receives
its output (which is a bit of a different way of doing things from how
we're doing it now)

I'm not too pumped about *b*, but that's the only way unless we're to
use signals or some other OOB mechanism. I'll code this up and report
back.