Fill missing dates within groups PostgreSQL

by topcat   Last Updated July 11, 2019 21:06 PM

I have the following table with values for different stations from 2014-01-01 to 2014-01-04. The data has some date gaps that I want to fill leaving the value as NULL, and assigning the missing date to each station.

This is my table:

CREATE TABLE stations (station_id text, value integer, date date);
INSERT INTO my_data (station_id, value, date) VALUES 
('001', 10, '2014-01-01'),
('001', 30, '2014-01-03'),
('002', 40, '2014-01-01'),
('002', 50, '2015-01-02'),
('003', 20, '2014-01-01'),
('003', 10, '2015-01-02'),
('003', 70, '2015-01-04');

And I want something like this:

| station | value | date       |
|---------|-------|------------|
| 001     | 10    | 2014-01-01 |
| 001     | NULL  | 2014-01-02 |
| 001     | 30    | 2014-01-03 |
| 001     | NULL  | 2014-01-04 |
| 002     | 40    | 2014-01-01 |
| 002     | 50    | 2014-01-02 |
| 002     | NULL  | 2014-01-03 |
| 002     | NULL  | 2014-01-04 |
| 003     | 20    | 2014-01-01 |
| 003     | 10    | 2014-01-02 |
| 003     | NULL  | 2014-01-03 |
| 003     | 70    | 2014-01-04 |

Following some DBA Exchange (questions)1, I tried a combination of a LEFT JOIN with a LATERAL JOIN:

with complete_dates_station as (
    select station_id,
           generate_series(date '2014-01-01', date '2014-12-31', interval '1 day')::date as dt
    from stations
    group by station_id
    ), temp_join as (
        select station_id,
               dt,
               s.value
        from complete_dates_station
            left join lateral (
                select s.value
                from stations s
                where s.station_id = complete_dates_station.station_id
                and s.date = complete_dates_station.dt
                order by s.station_id, date desc
                limit 1) as s on true
             order by station_id, dt
         ) select * from temp_join

This works like a charm, but this join is really slow for my complete table, which has more than 2M rows and the date range goes over 18 years (I stopped after 4 hrs of running). I tried a simpler approach by using a regular LEFT JOIN, but the table outputs the not-joined groups as missings:

with complete_dates_station as (
    select station_id,
           generate_series(date '2014-01-01', date '2014-12-31', interval '1 day')::date as dt
    from stations
    group by station_id)
select s.station_id,
       c.dt,
       s.value
from complete_dates_station c
    left outer join stations s
    on c.station_id = s.station_id and
    c.dt = s.date;

which yields the following:

| station | value | date       |
|---------|-------|------------|
| 001     | 10    | 2014-01-01 |
| NULL    | NULL  | 2014-01-02 |
| 001     | 30    | 2014-01-03 |
| NULL    | NULL  | 2014-01-04 |
| 002     | 40    | 2014-01-01 |
| 002     | 50    | 2014-01-02 |
| NULL    | NULL  | 2014-01-03 |
| NULL    | NULL  | 2014-01-04 |
| 003     | 20    | 2014-01-01 |
| 003     | 10    | 2014-01-02 |
| NULL    | NULL  | 2014-01-03 |
| 003     | 70    | 2014-01-04 |

There is any way to optimize the first query, or use a simpler approach to fill my station gaps in the second query?

Tags : postgresql


Related Questions


Updated April 09, 2018 20:06 PM

Updated June 06, 2017 18:06 PM

Updated May 15, 2017 19:06 PM

Updated March 15, 2017 02:06 AM