Rendez-vous With Cassidoo #211: longuest prefix, in sql

8th Sep 2021 cassidoo sql puzzle

I've skipped quite a lot of Cassidoo's question of the week this summer, but let's get back to it, in SQL of course!

The question

This week’s question: Write a function to find the longest common prefix string in an array of strings.

Example:
$ longestPrefix(["cranberry","crawfish","crap"])
$ "cra"
$ longestPrefix(["parrot", "poodle", "fish"])
$ ""

TL;DR

Here is what I came up with. Explanations are below:

CREATE OR REPLACE FUNCTION public.longuest_prefix(words text[])
RETURNS text
LANGUAGE sql
IMMUTABLE LEAKPROOF STRICT
AS $function$
with broken(by_idx, idx) as (
  select
    string_agg(c, '') by_idx,
    idx
  from
    unnest(words) as v(text),
    lateral string_to_table(text, NULL) with ordinality as s(c, idx)
  group by idx
),
matching(matching_char, previous_matched, idx) as (
  select
    common_char as matching_char,
    bool_and(common_char is not null) over (order by idx) as previous_matched,
    idx
  from
    broken,
    lateral (
      select (regexp_match(by_idx, '^(.)\1{' || cardinality(words) - 1 || '}$'))[1]
    ) as regex_res(common_char)
),
prefix(v) as (
  select
    string_agg(matching_char, '' order by idx) filter (where previous_matched)
  from
    matching
)
select
  coalesce(
    (select prefix.v from prefix),
    case when cardinality(words) > 0 then empty else null end
  )
from
  (values('')) as _(empty)
  ;
$function$;

How?

There's certainly a lot of ways to do that. I tried to be as SQL idiomatic as possible which means dealing with tables and aggregations because that's what SQL is good at. So here is my idea:

from the input array, make a table of characters with their position in their string of origin.
aggregate chars by position
then find what is the common character for one position (if it exists)
and aggregate all these characters up until the strings differ.

Take the array, and make it a table

The easy part is to have a table of words:

=# select * from unnest(array['foo', 'bar', 'baz']);
 unnest
════════
 foo
 bar
 baz
(3 rows)

Next we want to split the words into characters. We also have a function string_to_table for that:

NOTE: string_to_table is an addition in PostgreSQL 14, not yet released at the time of writing, but you can use regexp_split_to_table to the same effect.

=# select * from string_to_table('accordeon', null);
 string_to_table
═════════════════
 a
 c
 c
 o
 r
 d
 e
 o
 n
(9 rows)

The with ordinality just adds indexes. These are our positions:

 string_to_table │ ordinality
═════════════════╪════════════
 a               │          1
 c               │          2
 c               │          3
 o               │          4
 r               │          5
 d               │          6
 e               │          7
 o               │          8
 n               │          9
(9 rows)

These indices will allow us to group by this position, and use string_agg to get each first char, each second etc.

In PostgreSQL at least, you can use the result of a set-returning function like unnest in a from clause: from my_function() as foo. This means the return value looks like a table and we cannot directly use string_to_table on the result. What we really want is to "apply string_to_table to each return row of unnest" and the construction for that is precisely a lateral join.

So putting it all together with an array of words:

select
    c,
    idx, 
    text as "Coming from" -- note: this column won't be used later, it's just to make the result table easier to read
from
    unnest(array['foo', 'bar', 'accordeon', 'baz']) as v(text),
    lateral string_to_table(text, NULL) with ordinality as s(c, idx)
order by idx;
c │ idx │ Coming from
═══╪═════╪═════════════
 f │   1 │ foo
 b │   1 │ baz
 a │   1 │ accordeon
 b │   1 │ bar
 c │   2 │ accordeon
 a │   2 │ baz
 a │   2 │ bar
 o │   2 │ foo
 z │   3 │ baz
 o │   3 │ foo
 r │   3 │ bar
 c │   3 │ accordeon
 o │   4 │ accordeon
 r │   5 │ accordeon
 d │   6 │ accordeon
 e │   7 │ accordeon
 o │   8 │ accordeon
 n │   9 │ accordeon
(18 rows)

Yes! We have a nice table in a model that hopefully suits our need.

Aggregate each word chars by position

Then the next step is to aggregate by position on the string. Why reconstructing a string? Because it'll allow us to use some regular expression to answer the next step. We just apply string_agg with the relevant group by and order by clause to the previous query.

select
    string_agg(c, '') by_idx,
    idx
from
    unnest(array['foo', 'bar', 'accordeon', 'baz']) as v(text),
    lateral string_to_table(text, NULL) with ordinality as s(c, idx)
group by idx
order by idx;
 by_idx │ idx
════════╪═════
 fbab   │   1
 oaca   │   2
 orcz   │   3
 o      │   4
 r      │   5
 d      │   6
 e      │   7
 o      │   8
 n      │   9
(9 rows)

Keep only common chars

So if our input data is ['foo1o', 'foo2o', 'foo3oo'], our new "input" table (made from the previous step) now look like:

by_idx │ idx
════════╪═════
 fff    │   1
 ooo    │   2
 ooo    │   3
 123    │   4
 ooo    │   5
 o      │   6
(6 rows)

Now we have to match lines that are a common prefix. These are lines where there is n times the same character (n being the input array length) and that every lines before matches this condition as well (to avoid getting the last 'o' in the result even though there is a non-matching character in the middle of the word).

The first condition can be checked with the following regex: ^(.)\1{2}$. It matches a character, then check it repeats 2 times. We'll substitute the 2 by cardinality(words) in the resulting function.

Speaking about regexes... have you noticed that we can solve this with one regex only? I'll leave that for the conclusion :-)

The second is a little more complex: for each line we have to check all the preceding lines. If you're a bit used to modern SQL, this is a big red arrow pointing to ... window functions \o/

So if broken is the table from next step, and words the original array, it looks like:

  select
    common_char as matching_char,
    -- this is the window function
    -- the window frame is automatically `unbounded preceding -> curent row` once you have a `order by` in the window clause
    -- so this means "do an `and` operation on `common_char is not null` of all the preceding rows including this one
    bool_and(common_char is not null) over (order by idx) as previous_matched,
    idx
  from
    broken,
    -- the lateral here is just a way to factorize the calculus of common_char, avoiding the repetition in the select
    lateral (
      select (regexp_match(by_idx, '^(.)\1{' || cardinality(words) - 1 || '}$'))[1]
    ) as regex_res(common_char)
matching_char │ previous_matched │ idx
═══════════════╪══════════════════╪═════
 f             │ t                │   1
 o             │ t                │   2
 o             │ t                │   3
 ¤             │ f                │   4
 o             │ f                │   5
 ¤             │ f                │   6

Aggregate the prefix

And then, it's standard sql aggregation:

  select
    -- note: here, you can use a classic `where` clause instead of the filter.
    string_agg(matching_char, '' order by idx) filter (where previous_matched)
  from
    matching

Polishing stuff

Are we done? Not quite!

Unit testing

First let's test that everything works correctly with pgtap:

-- activate pgtap extension
-- then run this with psql
-- or use pg_prove
select plan(11);

\i ./20210830_longuest_prefix.sql

select is(longuest_prefix(null), null);
select is(longuest_prefix(ARRAY[]::text[]), null);
select is(longuest_prefix(array['foo']), 'foo');
select is(longuest_prefix(array['foo', 'bar']), '');
select is(longuest_prefix(array['foo1', 'foo2']), 'foo');
select is(longuest_prefix(array['foo1o', 'foo2o', 'foo3oo']), 'foo');
select is(longuest_prefix(array['pandemie', 'panflute', 'panpanpan']), 'pan');
select is(longuest_prefix(array['qwerty', 'qwertyuiop']), 'qwerty');
select is(longuest_prefix(array['cranberry','crawfish','crap']), 'cra');
select is(longuest_prefix(array['parrot', 'poodle', 'fish']), '');
select is(longuest_prefix(array[
                            'Xray',
                            'Xalam',
                            'Xanthoma',
                            'Xanthophyll',
                            'Xebec',
                            'Xenium',
                            'Xenomania',
                            'Xenophobia',
                            'Xiphias',
                            'Xiphodon',
                            'Xylograph',
                            'Xylophagia',
                            'Xylophone',
                            'Xyris']),
                          'X');

We quickly notice that the fourth test: select is(longuest_prefix(array['foo', 'bar']), '') fails. It currently returns null. Null can be used where there is no meaninful answer (null input, empty array...) but I'd argue that in this case, the meaningful answer should be '', the empty string. Next section details how to do it.

Getting empty string

From the previous result, we need to either return null or the empty string and that's where we have a problem: did you notice that when there is no common prefix or when the array is empty or null, the results of the previous queries are not really NULL, but actually an empty set (no row)?

But to be able to actually do something in sql, we need at least one row in our from clause! So we at least need to generat one fake row. Here I use values('') to do so, but anything would do, as '' is also a value that could be used by itself.

I propose 2 ways. First is a cross join (or a left join) on this fake value with prefix (which has only one row at most).

select
  case when cardinality(words) > 0 and prefix.v is null then
    empty
  else
    prefix.v
  end
from
  -- trick to always get a line, even when prefix is empty which is the case whenever there is no common prefix
  -- if we don't do that, we wont even 
  (values('')) as _(empty),
  prefix;

Or another way, maybe a bit more intuitive by using a subquery instead of a join:

select
  -- if the select is null, then the value below is applied
  coalesce(
    -- the parenthesis "force" the unwrapping of the select here
    -- it works only if this query returns only one scalar value of course
    (select prefix.v from prefix),
    case when cardinality(words) > 0 then empty else null end
  )
from
  -- we still want one and only one line, but we don't use empty in all cases
  (values('')) as _(empty)
  ;

Some function configurations

In the function declaration, we use IMMUTABLE LEAKPROOF STRICT as optimisation. strict states that PostgreSQL does not need to execute the function if the input is null, it can return null directly. immutable states that for one given input, this function will always return the same value, allowing some optimizations on query plan and leakproof is an indication for security (allowing some optimisations in some cases as well).

Conclusion

Not all coding puzzles are interesting to do in SQL. Those involving a lot of pure iterations (that can't be solved by reasoning about rows and aggregations) can be cumbersome and awkward to write (even though it is still a good thing to try in my opinion). But this code challenge adapts very well in SQL providing we try to be idiomatic and make good use of unnest and string_to_table. We can then use all the power of aggregation SQL provides!

And yes, we can solve this in a ~~simpler~~ more concise way with a regular expression:

select 
    (regexp_match(s, '^(.*)[^\n]*(\n|$)(^\1.*(\n|$))*$', 'm'))[1] 
from 
    array_to_string(array['foo1o', 'foo2o', 'foo3oo'], E'\n') s;

Previous Post Next Post