I've skipped quite a lot of Cassidoo's question of the week this summer, but let's get back to it, in SQL of course!
This week’s question: Write a function to find the longest common prefix string in an array of strings.
Example:
$ longestPrefix(["cranberry","crawfish","crap"]) $ "cra" $ longestPrefix(["parrot", "poodle", "fish"]) $ ""
Here is what I came up with. Explanations are below:
CREATE OR REPLACE FUNCTION public.longuest_prefix(words text[])
RETURNS text
LANGUAGE sql
IMMUTABLE LEAKPROOF STRICT
AS $function$
with broken(by_idx, idx) as (
select
string_agg(c, '') by_idx,
idx
from
unnest(words) as v(text),
lateral string_to_table(text, NULL) with ordinality as s(c, idx)
group by idx
),
matching(matching_char, previous_matched, idx) as (
select
common_char as matching_char,
bool_and(common_char is not null) over (order by idx) as previous_matched,
idx
from
broken,
lateral (
select (regexp_match(by_idx, '^(.)\1{' || cardinality(words) - 1 || '}$'))[1]
) as regex_res(common_char)
),
prefix(v) as (
select
string_agg(matching_char, '' order by idx) filter (where previous_matched)
from
matching
)
select
coalesce(
(select prefix.v from prefix),
case when cardinality(words) > 0 then empty else null end
)
from
(values('')) as _(empty)
;
$function$;
There's certainly a lot of ways to do that. I tried to be as SQL idiomatic as possible which means dealing with tables and aggregations because that's what SQL is good at. So here is my idea:
The easy part is to have a table of words:
=# select * from unnest(array['foo', 'bar', 'baz']);
unnest
════════
foo
bar
baz
(3 rows)
Next we want to split the words into characters. We also have a function string_to_table
for that:
NOTE:
string_to_table
is an addition in PostgreSQL 14, not yet released at the time of writing, but you can useregexp_split_to_table
to the same effect.
=# select * from string_to_table('accordeon', null);
string_to_table
═════════════════
a
c
c
o
r
d
e
o
n
(9 rows)
The with ordinality
just adds indexes. These are our positions:
string_to_table │ ordinality
═════════════════╪════════════
a │ 1
c │ 2
c │ 3
o │ 4
r │ 5
d │ 6
e │ 7
o │ 8
n │ 9
(9 rows)
These indices will allow us to group by this position, and use string_agg
to get each first char, each second etc.
In PostgreSQL at least, you can use the result of a set-returning function like unnest
in a from clause: from my_function() as foo
. This means the return value looks like a table and we cannot directly use string_to_table
on the result. What we really want is to "apply string_to_table to each return row of unnest" and the construction for that is precisely a lateral
join.
So putting it all together with an array of words:
select
c,
idx,
text as "Coming from" -- note: this column won't be used later, it's just to make the result table easier to read
from
unnest(array['foo', 'bar', 'accordeon', 'baz']) as v(text),
lateral string_to_table(text, NULL) with ordinality as s(c, idx)
order by idx;
c │ idx │ Coming from
═══╪═════╪═════════════
f │ 1 │ foo
b │ 1 │ baz
a │ 1 │ accordeon
b │ 1 │ bar
c │ 2 │ accordeon
a │ 2 │ baz
a │ 2 │ bar
o │ 2 │ foo
z │ 3 │ baz
o │ 3 │ foo
r │ 3 │ bar
c │ 3 │ accordeon
o │ 4 │ accordeon
r │ 5 │ accordeon
d │ 6 │ accordeon
e │ 7 │ accordeon
o │ 8 │ accordeon
n │ 9 │ accordeon
(18 rows)
Yes! We have a nice table in a model that hopefully suits our need.
Then the next step is to aggregate by position on the string. Why reconstructing a string? Because it'll allow us to use some regular expression to answer the next step.
We just apply string_agg
with the relevant group by
and order by
clause to the previous query.
select
string_agg(c, '') by_idx,
idx
from
unnest(array['foo', 'bar', 'accordeon', 'baz']) as v(text),
lateral string_to_table(text, NULL) with ordinality as s(c, idx)
group by idx
order by idx;
by_idx │ idx
════════╪═════
fbab │ 1
oaca │ 2
orcz │ 3
o │ 4
r │ 5
d │ 6
e │ 7
o │ 8
n │ 9
(9 rows)
So if our input data is ['foo1o', 'foo2o', 'foo3oo']
, our new "input" table (made from the previous step) now look like:
by_idx │ idx
════════╪═════
fff │ 1
ooo │ 2
ooo │ 3
123 │ 4
ooo │ 5
o │ 6
(6 rows)
Now we have to match lines that are a common prefix. These are lines where there is n times the same character (n being the input array length) and that every lines before matches this condition as well (to avoid getting the last 'o' in the result even though there is a non-matching character in the middle of the word).
The first condition can be checked with the following regex: ^(.)\1{2}$
. It matches a character, then check it repeats 2 times. We'll substitute the 2 by cardinality(words)
in the resulting function.
Speaking about regexes... have you noticed that we can solve this with one regex only? I'll leave that for the conclusion :-)
The second is a little more complex: for each line we have to check all the preceding lines. If you're a bit used to modern SQL, this is a big red arrow pointing to ... window functions \o/
So if broken
is the table from next step, and words
the original array, it looks like:
select
common_char as matching_char,
-- this is the window function
-- the window frame is automatically `unbounded preceding -> curent row` once you have a `order by` in the window clause
-- so this means "do an `and` operation on `common_char is not null` of all the preceding rows including this one
bool_and(common_char is not null) over (order by idx) as previous_matched,
idx
from
broken,
-- the lateral here is just a way to factorize the calculus of common_char, avoiding the repetition in the select
lateral (
select (regexp_match(by_idx, '^(.)\1{' || cardinality(words) - 1 || '}$'))[1]
) as regex_res(common_char)
matching_char │ previous_matched │ idx
═══════════════╪══════════════════╪═════
f │ t │ 1
o │ t │ 2
o │ t │ 3
¤ │ f │ 4
o │ f │ 5
¤ │ f │ 6
And then, it's standard sql aggregation:
select
-- note: here, you can use a classic `where` clause instead of the filter.
string_agg(matching_char, '' order by idx) filter (where previous_matched)
from
matching
Are we done? Not quite!
First let's test that everything works correctly with pgtap:
-- activate pgtap extension
-- then run this with psql
-- or use pg_prove
select plan(11);
\i ./20210830_longuest_prefix.sql
select is(longuest_prefix(null), null);
select is(longuest_prefix(ARRAY[]::text[]), null);
select is(longuest_prefix(array['foo']), 'foo');
select is(longuest_prefix(array['foo', 'bar']), '');
select is(longuest_prefix(array['foo1', 'foo2']), 'foo');
select is(longuest_prefix(array['foo1o', 'foo2o', 'foo3oo']), 'foo');
select is(longuest_prefix(array['pandemie', 'panflute', 'panpanpan']), 'pan');
select is(longuest_prefix(array['qwerty', 'qwertyuiop']), 'qwerty');
select is(longuest_prefix(array['cranberry','crawfish','crap']), 'cra');
select is(longuest_prefix(array['parrot', 'poodle', 'fish']), '');
select is(longuest_prefix(array[
'Xray',
'Xalam',
'Xanthoma',
'Xanthophyll',
'Xebec',
'Xenium',
'Xenomania',
'Xenophobia',
'Xiphias',
'Xiphodon',
'Xylograph',
'Xylophagia',
'Xylophone',
'Xyris']),
'X');
We quickly notice that the fourth test: select is(longuest_prefix(array['foo', 'bar']), '')
fails. It currently returns null
. Null can be used where there is no meaninful answer (null input, empty array...) but I'd argue that in this case, the meaningful answer should be '', the empty string. Next section details how to do it.
From the previous result, we need to either return null or the empty string and that's where we have a problem: did you notice that when there is no common prefix or when the array is empty or null, the results of the previous queries are not really NULL
, but actually an empty set (no row)?
But to be able to actually do something in sql, we need at least one row in our from
clause! So we at least need to generat one fake row. Here I use values('')
to do so, but anything would do, as '' is also a value that could be used by itself.
I propose 2 ways. First is a cross join (or a left join) on this fake value with prefix (which has only one row at most).
select
case when cardinality(words) > 0 and prefix.v is null then
empty
else
prefix.v
end
from
-- trick to always get a line, even when prefix is empty which is the case whenever there is no common prefix
-- if we don't do that, we wont even
(values('')) as _(empty),
prefix;
Or another way, maybe a bit more intuitive by using a subquery instead of a join:
select
-- if the select is null, then the value below is applied
coalesce(
-- the parenthesis "force" the unwrapping of the select here
-- it works only if this query returns only one scalar value of course
(select prefix.v from prefix),
case when cardinality(words) > 0 then empty else null end
)
from
-- we still want one and only one line, but we don't use empty in all cases
(values('')) as _(empty)
;
In the function declaration, we use IMMUTABLE LEAKPROOF STRICT
as optimisation. strict
states that PostgreSQL does not need to execute the function if the input is null, it can return null directly. immutable
states that for one given input, this function will always return the same value, allowing some optimizations on query plan and leakproof
is an indication for security (allowing some optimisations in some cases as well).
Not all coding puzzles are interesting to do in SQL. Those involving a lot of pure iterations (that can't be solved by reasoning about rows and aggregations) can be cumbersome and awkward to write (even though it is still a good thing to try in my opinion). But this code challenge adapts very well in SQL providing we try to be idiomatic and make good use of unnest
and string_to_table
. We can then use all the power of aggregation SQL provides!
And yes, we can solve this in a simpler more concise way with a regular expression:
select
(regexp_match(s, '^(.*)[^\n]*(\n|$)(^\1.*(\n|$))*$', 'm'))[1]
from
array_to_string(array['foo1o', 'foo2o', 'foo3oo'], E'\n') s;