TransWikia.com

How to get non-overlapping distinct intervals from a PostgreSQL table?

Database Administrators Asked on October 28, 2021

Using postgresql 9.6.

The table has user sessions and I need distinct non overlapping sessions printed.

CREATE TABLE SESSIONS(
            id serial NOT NULL PRIMARY KEY, 
            ctn INT NOT NULL, 
            day DATE NOT NULL,
            f_time TIME(0) NOT NULL,
            l_time TIME(0) NOT  NULL
        );     
    INSERT INTO SESSIONS(id, ctn, day, f_time, l_time)
    VALUES
    (1, 707, '2019-06-18', '10:48:25', '10:56:17'),
    (2, 707, '2019-06-18', '10:48:33', '10:56:17'),
    (3, 707, '2019-06-18', '10:53:17', '11:00:49'),
    (4, 707, '2019-06-18', '10:54:31', '10:57:37'),
    (5, 707, '2019-06-18', '11:03:59', '11:10:39'),
    (6, 707, '2019-06-18', '11:04:41', '11:08:02'),
    (7, 707, '2019-06-18', '11:11:04', '11:19:39');

sql fiddle

id  ctn day         f_time      l_time
1   707 2019-06-18  10:48:25    10:56:17
2   707 2019-06-18  10:48:33    10:56:17
3   707 2019-06-18  10:53:17    11:00:49
4   707 2019-06-18  10:54:31    10:57:37
5   707 2019-06-18  11:03:59    11:10:39
6   707 2019-06-18  11:04:41    11:08:02
7   707 2019-06-18  11:11:04    11:19:39

Now I need distinct non-overlapping user sessions, so it should give me:

1.  start_time: 10:48:25  end_time: 11:00:49  duration: 12min,24 sec
2.  start_time: 11:03:59  end_time: 11:10:39  duration: 6min,40 sec
3.  start_time: 11:11:04  end_time: 11:19:39  duration: 8min,35 sec

2 Answers

I used the accepted answer for my own needs until I found a limit:

What if we have these rows ?

( 9, 707, '2019-06-18', '12:15:15', '13:13:13'),
(10, 707, '2019-06-18', '13:04:41', '13:10:02'), 
(11, 707, '2019-06-18', '13:12:17', '13:22:22'),

Both queries create a new group starting on row 10 instead of grouping these 3 rows together.

 Interval No. |               | Interval start |             | Interval stop |             | Duration
--------------+---------------+----------------+-------------+---------------+-------------+----------
            1 |  Start time:  | 10:48:25       |  End time:  | 11:00:49      |  Duration:  | 00:12:24
            2 |  Start time:  | 11:03:59       |  End time:  | 11:10:39      |  Duration:  | 00:06:40
            3 |  Start time:  | 11:11:04       |  End time:  | 11:19:39      |  Duration:  | 00:08:35
            4 |  Start time:  | 12:15:15       |  End time:  | 13:13:13      |  Duration:  | 00:57:58
            5 |  Start time:  | 13:12:17       |  End time:  | 13:22:22      |  Duration:  | 00:10:05
            6 |  Start time:  | 14:05:17       |  End time:  | 14:14:14      |  Duration:  | 00:08:57
Intervals 4 and 5 are overlapping.

The answer is to tweak the core query

WHEN LAG(c.lt) OVER () > ft THEN 0

needs to be replace by

WHEN max(c.lt) over (order by c.ft, c.lt rows between unbounded preceding and 1 preceding) > ft THEN 0

then you'll get:

 Interval No. |               | Interval start |             | Interval stop |             | Duration
--------------+---------------+----------------+-------------+---------------+-------------+----------
            1 |  Start time:  | 10:48:25       |  End time:  | 11:00:49      |  Duration:  | 00:12:24
            2 |  Start time:  | 11:03:59       |  End time:  | 11:10:39      |  Duration:  | 00:06:40
            3 |  Start time:  | 11:11:04       |  End time:  | 11:19:39      |  Duration:  | 00:08:35
            4 |  Start time:  | 12:15:15       |  End time:  | 13:22:22      |  Duration:  | 01:07:07
            5 |  Start time:  | 14:05:17       |  End time:  | 14:14:14      |  Duration:  | 00:08:57
(5 rows)

Answered by Gilles on October 28, 2021

To solve this problem I did the following:

"Easy" explanation:

For this part I slightly added to the table definition provided by the OP. I'm firmly of the belief that DDL should be used to the maximum extent possible to "guide" the whole database programming process and could be much more powerful - an example of this would be SQL in CHECK constraints - so far only provided by Firebird (example here) and H2 (see reference here).

However, that's all very well and good, but we have to deal with PostgreSQL's 9.6 capabilities - the OP's version. My adjusted DDL for the "simple" explanation (see the entire fiddle here):

CREATE TABLE sessions
(
        id serial NOT NULL PRIMARY KEY, 
        ctn INT NOT NULL, 
        f_day DATE NOT NULL,
        f_time TIME(0) NOT NULL,
        l_time TIME(0) NOT  NULL,
        CONSTRAINT ft_less_than_lt_ck CHECK (f_time < l_time),
        CONSTRAINT ctn_f_day_f_time_uq UNIQUE (ctn, f_day, f_time),
        CONSTRAINT ctn_f_day_l_time_uq UNIQUE (ctn, f_day, l_time)
        -- could put in a DISTINCT somewhere if you don't have these constraints
        -- maybe has TIME(2) - but see complex solution
);

Indexes:

CREATE INDEX ctn_ix ON sessions USING BTREE (ctn ASC);
CREATE INDEX f_day_ix ON sessions USING BTREE (f_day ASC);
CREATE INDEX f_time_ix ON sessions USING BTREE (f_time ASC);

Just a point to note: do not use SQL keywords as table or column names - day is such a keyword! It can be confusing for debugging &c - it's simply not good practice. I've changed your original fieldname of day to f_day - note all lower and python case! Whatever you do, have a standard method of naming variables and stick to it - there are many coding standards documents out there.

The change to 'f_day' has no effect on the rest of the SQL as we don't take account of sessions which span midnight. Taking account of these can be done relatively easily by doing to the following (see fiddle).

SELECT (f_day + f_time)::TIMESTAMP FROM sessions;

Now the advent of GENERATED columns, you don't even have to worry about this - just have a GENERATED field as above!

If a constraint to the second is unworkable - logins at the same time, you could possibly use TIME(2) (or 3..6) for ensuring uniqueness. If [you don't want | can't have] UNIQUE constraints, you could put in DISTINCT in your SQL for identical login and logout times - though this is unlikely.

The fact remains that some simple DDL like this enormously simplifies your subsequent SQL (see discussion at the end of the "complex" explanation below).

You might also want to put ctn and/or day in your DDL UNIQUE constraints as shown? I've also added what I think may be good indexes! You might also want to investigate the OVERLAPS operator?

As for the sample data, I also added a few records for testing my solution as follows:

INSERT INTO sessions (id, ctn, day, f_time, l_time)
VALUES
( 1, 707, '2019-06-18', '10:48:25', '10:56:17'), 
( 2, 707, '2019-06-18', '10:53:17', '11:00:49'),
( 3, 707, '2019-06-18', '10:54:31', '10:59:43'),  -- record 3 is completely covered 
                                                  -- by record 2

( 4, 707, '2019-06-18', '11:03:59', '11:10:39'), 
( 5, 707, '2019-06-18', '11:04:41', '11:08:02'), -- GROUP 2 record 6 completely
                                                 -- covers record 7
                                                 
( 6, 707, '2019-06-18', '11:11:04', '11:19:39'), -- GROUP 3

( 7, 707, '2019-06-18', '12:15:15', '13:13:13'),
( 8, 707, '2019-06-18', '13:04:41', '13:20:02'), 
( 9, 707, '2019-06-18', '13:17:17', '13:22:22'), -- GROUP 4

(10, 707, '2019-06-18', '14:05:17', '14:14:14'); -- GROUP 5

I'll go through my logic step by step - good for you perhaps, but also for me as it helps me to clarify my thinking and will ensure that the lessons which I have learnt from this exercise will stay with me - "I hear and I forget. I see and I remember. I do and I understand." - Confucius.

All of the following is included in the fiddle.

/**

So, the desired result is:

Interval 1 - start: 10:48:25 - end 11:00:49
Interval 2 - start: 11:03:59 - end 11:10:39 
Interval 3 - start: 11:11:04 - end 11:19:39
Interval 4 - start: 12:15:15 - end 13:22:22
Interval 5 - start: 14:05:17 - end 14:14:14

**/

The first step is to use the LAG function (documentation) as follows:

SELECT 
  s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
  CASE
    WHEN LAG(s.l_time) OVER () > f_time THEN 0
    ELSE 1
  END AS ovl
FROM sessions s

Result:

id  ctn     ft  lt  ovl
1   707     10:48:25    10:56:17    1
2   707     10:53:17    11:00:49    0
3   707     10:54:31    10:59:43    0
4   707     11:03:59    11:10:39    1
5   707     11:04:41    11:08:02    0
6   707     11:11:04    11:19:39    1
7   707     12:15:15    13:13:13    1
8   707     13:04:41    13:20:02    0
9   707     13:17:17    13:22:22    0
10  707     14:05:17    14:14:14    1

So, whenever there's a new interval, there's a 1 in the ovl (overlap) column.

Next, we take the cumulative SUM of those 1's as follows:

SELECT 
  t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
  SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                          AND CURRENT ROW) AS s
FROM
(
  SELECT 
    s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
    CASE
      WHEN LAG(s.l_time) OVER () > f_time THEN 0
      ELSE 1
    END AS ovl
  FROM sessions s
) AS t1
ORDER BY lt, id

Result:

id  ctn     ft  lt  ovl     s
1   707     10:48:25    10:56:17    1   1
3   707     10:54:31    10:59:43    0   1
2   707     10:53:17    11:00:49    0   1
5   707     11:04:41    11:08:02    0   2
4   707     11:03:59    11:10:39    1   2
6   707     11:11:04    11:19:39    1   3
7   707     12:15:15    13:13:13    1   4
8   707     13:04:41    13:20:02    0   4
9   707     13:17:17    13:22:22    0   4
10  707     14:05:17    14:14:14    1   5

So, we have now "split out", and have a way of distinguishing between, our intervals - each interval has a different value of s - 1..5.

So, now we want to get the lowest value of f_time and the highest value of l_time for these intervals. My first attempt using MAX() and MIN() went as follows:

SELECT 
  ROW_NUMBER() OVER (PARTITION BY s) AS rn,
  MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt) AS min_f, 
  MAX(lt) OVER (PARTITION BY s ORDER BY ft, lt) AS max_l,
  s
FROM
(
  SELECT 
    t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
    SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                           AND CURRENT ROW) AS s
  FROM
  (
    SELECT 
      s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
      CASE
        WHEN LAG(s.l_time) OVER () > f_time THEN 0
        ELSE 1
      END AS ovl
    FROM sessions s
  ) AS t1;
  ORDER BY id, lt
)AS t2
ORDER BY s, rn ASC, min_f;

Result:

rn  min_f   max_l   s
1   10:48:25    10:56:17    1
2   10:48:25    11:00:49    1
3   10:48:25    11:00:49    1
1   11:03:59    11:10:39    2
2   11:03:59    11:10:39    2
1   11:11:04    11:19:39    3
1   12:15:15    13:13:13    4
2   12:15:15    13:20:02    4
3   12:15:15    13:22:22    4
1   14:05:17    14:14:14    5

Note how we have to get rn = 3 for the first interval, rn = 3 for the fourth and different values of rn for different intervals - if there were 7 sub-intervals making up one interval, then we would have to retrieve rn = 7 - this stumped me for a while!

Then the power of Window functions came to the rescue - if you sort the MAX() and MIN() differently, the correct result drops into our lap:

SELECT 
  ROW_NUMBER() OVER (PARTITION BY s) AS rn,
  MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f, 
  MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
  s
FROM
(
  SELECT 
    t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
    SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                           AND CURRENT ROW) AS s
  FROM
  (
    SELECT 
      s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
      CASE
        WHEN LAG(s.l_time) OVER () > f_time THEN 0
        ELSE 1
      END AS ovl
    FROM sessions s
  ) AS t1
  ORDER BY id, lt
)AS t2
ORDER BY s, rn ASC, min_f;

Result:

rn  min_f   max_l   s
1   10:48:25    11:00:49    1
2   10:48:25    11:00:49    1
3   10:48:25    10:59:43    1
1   11:03:59    11:10:39    2
2   11:03:59    11:08:02    2
1   11:11:04    11:19:39    3
1   12:15:15    13:22:22    4
2   12:15:15    13:22:22    4
3   12:15:15    13:22:22    4
1   14:05:17    14:14:14    5

Note that now, rn = 1 is always our desired record - this is the result of:

  MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f, 
  MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,

Note that for MIN(), the sort is by lt DESC and for the MAX() (partitioned by interval - i.e. s) it is by ft DESC. This matches up the smallest ft with the largest lt which is what we want.

This is essentially our desired result - just add a bit of tidying up and formatting as per the OP's requirements and we're good to go. This part also demonstrates the use of another very useful Window function - ROW_NUMBER().

SELECT 
  ROW_NUMBER() OVER () AS "Interval No.", 
  ' Start time: ' AS " ",
  t3.min_f AS "Interval start" , 
  ' End time: ' AS " ",
  t3.max_l AS "Interval stop", 
  ' Duration: ' AS " ",
  (t3.max_l - t3.min_f) AS "Duration"
FROM
(
  SELECT 
    ROW_NUMBER() OVER (PARTITION BY s) AS rn,
    MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f, 
    MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
    s
  FROM
  (
    SELECT 
      t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
      SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                             AND CURRENT ROW) AS s
    FROM
    (
      SELECT 
        s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
        CASE
          WHEN LAG(s.l_time) OVER () > f_time THEN 0
          ELSE 1
        END AS ovl
      FROM sessions s
    ) AS t1
    ORDER BY id, lt
  )AS t2
  ORDER BY s, rn ASC, min_f
) AS t3 
WHERE t3.rn = 1;

Final result:

Interval No.        Interval start      Interval stop       Duration
1    Start time:    10:48:25     End time:  11:00:49     Duration:  00:12:24
2    Start time:    11:03:59     End time:  11:10:39     Duration:  00:06:40
3    Start time:    11:11:04     End time:  11:19:39     Duration:  00:08:35
4    Start time:    12:15:15     End time:  13:22:22     Duration:  01:07:07
5    Start time:    14:05:17     End time:  14:14:14     Duration:  00:08:57

I can make no guarantees about the performance of this query if there is a large number of records, see the result of EXPLAIN (ANALYZE, BUFFERS) at the end of the fiddle. However, I'm assuming that since it's in a report style format, it might be for a given value of ctn and/or day - i.e. not too many records?

"Complex" explanation:

I won't show every step - after eliminating the duplicate f_times and l_times, the steps are identical.

Here, the table definition and data are slightly different (fiddle available here):

CREATE TABLE sessions
(
        id serial NOT NULL PRIMARY KEY, 
        ctn INT NOT NULL, 
        f_day DATE NOT NULL,
        f_time TIME(0) NOT NULL,
        l_time TIME(0) NOT  NULL,
        CONSTRAINT ft_lt_lt CHECK (f_time < l_time),
        -- CONSTRAINT ft_uq UNIQUE (f_time),
        -- CONSTRAINT lt_uq UNIQUE (l_time)
        CONSTRAINT ft_lt_uq UNIQUE(f_time, l_time) 
        -- could put in a DISTINCT somewhere to counter this possibility or
        -- maybe have TIME(2) to ensure no duplicates?
);

The only constraints that I've kept are CHECK (f_time < l_time) (couldn't be any other way) and UNIQUE f_time, l_time (maybe add day and/or ctn to that - the advice about TIME(2) or (3...6) above also applies.

I'll leave it up to the reader to apply UNIQUE to combinations of ctn and f_day as applicable!

INSERT INTO sessions (id, ctn, day, f_time, l_time)
VALUES
( 1, 707, '2019-06-18', '10:48:25', '10:56:17'), -- note - same l_times
( 2, 707, '2019-06-18', '10:48:33', '10:56:17'), -- need one with lowest f_time
( 3, 707, '2019-06-18', '10:53:17', '11:00:49'),
( 4, 707, '2019-06-18', '10:54:31', '10:59:43'), -- note - same f_times
                                                 -- need one with greatest l_time
( 5, 707, '2019-06-18', '10:54:31', '10:57:37'), -- GROUP 1

( 6, 707, '2019-06-18', '11:03:59', '11:10:39'), 
( 7, 707, '2019-06-18', '11:04:41', '11:08:02'), -- GROUP 2, record 6 completely
                                                 -- covers record 7
( 8, 707, '2019-06-18', '11:11:04', '11:19:39'), -- GROUP 3

( 9, 707, '2019-06-18', '12:15:15', '13:13:13'),
(10, 707, '2019-06-18', '13:04:41', '13:20:02'), 
(11, 707, '2019-06-18', '13:17:17', '13:22:22'), -- GROUP 4

(12, 707, '2019-06-18', '14:05:17', '14:14:14'); -- GROUP 5

I've added a couple of potentially "troublesome" records (2 & 4) with the same f_time and l_time within the same interval. So, in the event of an identical f_time, we want the sub-interval with the greatest l_time and vice-versa for the case of an identical l_time (i.e. smallest f_time).

So, what I did in this case was to eliminate duplicates by chaining CTE's (aka the WITH clause) as follows:

WITH cte1 AS 
(
  SELECT s.*, t.mt, t.lt
  FROM sessions s
  JOIN
  (
    SELECT
      DISTINCT 
      ctn,
      MIN(f_time) AS mt,
      l_time AS lt
    FROM sessions
    GROUP BY ctn, l_time
    ORDER BY l_time
  ) AS t
  ON (s.ctn, s.f_time, s.l_time) = (t.ctn, t.mt, t.lt)
  ORDER BY s.l_time
), 
cte2 AS
(
  SELECT
    DISTINCT
    ctn,
    f_time AS ft,
    MAX(lt) AS lt
  FROM cte1
  GROUP BY ctn, f_time
  ORDER BY f_time
)
SELECT * FROM cte2
ORDER BY ft;

Result:

ctn     ft  lt
707     10:48:25    10:56:17
707     10:53:17    11:00:49
707     10:54:31    10:59:43
707     11:03:59    11:10:39
707     11:04:41    11:08:02
707     11:11:04    11:19:39
707     12:15:15    13:13:13
707     13:04:41    13:20:02
707     13:17:17    13:22:22
707     14:05:17    14:14:14

And then I treat cte2 as the starting point for the process in the "easy" explanation.

The final SQL is as follows:

WITH cte1 AS 
(
  SELECT s.*, t.mt, t.lt
  FROM sessions s
  JOIN
  (
    SELECT
      DISTINCT 
      ctn,
      MIN(f_time) AS mt,
      l_time AS lt
    FROM sessions
    GROUP BY ctn, l_time
    ORDER BY l_time
  ) AS t
  ON (s.ctn, s.f_time, s.l_time) = (t.ctn, t.mt, t.lt)
  ORDER BY s.l_time
), 
cte2 AS
(
  SELECT
    DISTINCT
    ctn,
    f_time AS ft,
    MAX(lt) AS lt
  FROM cte1
  GROUP BY ctn, f_time
  ORDER BY f_time
)
SELECT 
  ROW_NUMBER() OVER () AS "Interval No.", 
  ' Start time: ' AS " ",
  t3.min_f AS "Interval start" , 
  ' End time: ' AS " ",
  t3.max_l AS "Interval stop", 
  ' Duration: ' AS " ",
  (t3.max_l - t3.min_f) AS "Duration"
FROM
(
  SELECT 
    ROW_NUMBER() OVER (PARTITION BY s) AS rn,
    MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f, 
    MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
    s
  FROM
  (
    SELECT 
    t1.ctn, t1.ft, t1.lt, t1.ovl,
    SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                           AND CURRENT ROW) AS s
    FROM
    (
      SELECT 
        c.ctn AS ctn, c.ft AS ft, c.lt AS lt, 
        CASE
          WHEN LAG(c.lt) OVER () > ft THEN 0
          ELSE 1
        END AS ovl
      FROM cte2 c
    ) AS t1
    ORDER BY t1.lt
  ) AS t2
  ORDER BY s, rn ASC, min_f
) AS t3
WHERE t3.rn = 1
ORDER BY t3.rn;

Result:

Interval No.        Interval start      Interval stop       Duration
1    Start time:    10:48:25     End time:  11:00:49     Duration:  00:12:24
2    Start time:    11:03:59     End time:  11:08:02     Duration:  00:04:03
3    Start time:    11:11:04     End time:  11:19:39     Duration:  00:08:35
4    Start time:    12:15:15     End time:  13:22:22     Duration:  01:07:07
5    Start time:    14:05:17     End time:  14:14:14     Duration:  00:08:57

As you can see, it's pretty hairy stuff - not having the UNIQUE constraints in the DDL has doubled the length of the SQL and the time taken for the planning and executions stages, and made it pretty horrible into the bargain.

See the end of the fiddle for the plans for both queries! Lessons to be learnt there! As a rule of thumb, the longer the plan, the slower the query!

I'm not sure that the indexes can play any role here since we're selecting from the entire table and it's very small! If we were filtering a large table by ctn and/or f_day and/or f_time, I'm pretty sure that we would start to see differences in the plans (and timings!) if there were no indexes!

Answered by Vérace on October 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP