Two Database Architecture : Operational and Historical

by Gudradain   Last Updated October 09, 2019 19:05 PM

I thought about an uncommon database structure and wonder if anyone has seen it in use before. It's basically using 2 databases:

  • First database hold only the data that is currently valid
  • Second database hold the history of everything that has ever been entered, updated or deleted in the first database

Scenario

I'm working on a project where I'm required to log everything that happens and where the data changes frequently.

Example (not the real one)

You have to do the database design for a soccer league. In this league there are players and teams. The players often switch teams.

  • First requirement : The database must hold the information necessary to play the next match. This means a list of every players, teams and in which team each player is currently.
  • Second requirement : The database must hold historical values that we will use to generate statistics. This means the list of all the players that have been part of a team or the list of all the teams a player has been a part of.

The problem

These two requirements are kinda the opposite of each other. I have tried to do everything in the same database but it doesn't make sense. The first requirement only cares about "playing the next match" while the second requirement only cares about "generating statistics".

To do everything in the same database, I went with a sort of "insert only" database using the obvious soft delete to delete/update information...

What initially seemed like an easy task, holding a list of players, teams and the current team of each players, suddenly becomes a lot harder. The application logic required to play the next match is already complicated enough but now the database has a very unhelpful design where the application is required to add "is deleted" check on every single query just to play the next match.

Would you want to be that coach that yell "all players in the team, come to me" and then 2000 players come at you. At which point, you will probably yell "all players that are not deleted in the team, come to me" (while swearing about this stupid design).

My conclusion

I came to wonder why you need to put everything in the same database. Not only does the soft delete do a poor job at logging everything unless you add many columns (time_created, who_created_it, time_deleted, who_deleted_it) but it also complicate everything. It complicates the database design and it complicates the application design.

Also, I receive these 2 requirements as part of one single application that cannot be split but I keep thinking : this is 2 completely distinct applications. Why am I trying to do everything together.

That's when I thought about splitting the database in two. An operational database that is used only to play the next match and only contain the information that is currently valid and an historical database that hold all the information that ever existed, when it was created, deleted and who did it.

The goal is to keep the first database (operational) and the application as simple as possible while having as much information as possible in the second database (historical).

Questions

  • Have you seen that design before? Does it have name?
  • Are there any obvious pitfalls that I am missing?



EDIT 2015-03-16

Current architecture

You can basically think about the whole architecture as a 2 steps process.

Step 1 :

  • Application is running and users are doing some actions
  • Each time an event happen, it is recorded automatically (audit solution) in an event table
  • Then the correct row, in the operational database is updated

Step 2 :

  • A job read the latest insertion in the event table and insert this new data in the historical database.
  • Users query the historical database to retrieve the information that they need.

Just from the event table, you can reconstruct the information to any point in time. The problem is that this event table is not easily queryable. This is where the historical database kicks in; to present the data into a way that it is easy to retrieve exactly what we want.

Additional problems when putting everything in the same tables

I have already expressed my concern about the added complexity of checking "is deleted" on each query. But there is another issue : integrity.

I make heavy use of foreign key and constraint to make sure that at any point in time, the data that is in my database is valid.

Let's look at an example :

Constraint : There can only be one goal keeper per team.

It's easy to add a unique index that check if there is only one goal keeper per team. But then what happen when you change the goal keeper. You still need to preserve the information about the previous one but now you have 2 goal keepers in the same teams, one active and one inactive, which contradicts your constraint.

Sure it's easy to add a check to your constraint, but it's another thing to manage and think about.

Tags : design database


Answers 5


This is actually similar to the way database transactions are generally implemented, except the historical data is usually thrown away after being written to the operational database. The closest programming pattern I can think of is event sourcing.

I think splitting these two databases up is the right move. More specifically, I would look at the "operational" database as a cache, since the historical data will be sufficient to re-construct the operational database at any time. Depending on the nature of your application and performance requirements, it may be unnecessary to maintain this cache as a separate database if it is reasonable to reconstruct the current state from the historical data in memory each time the program is started.

As far as pitfalls, the main issue you could run into is if you need any kind of concurrency (either in the same program or by having multiple clients use the database at the same time). In that case you would want to make sure that the modifications to the historical and operational databases is done atomically. In the case of concurrency within the same program, your best bet is probably some sort of locking mechanism. For multiple clients interacting with the same database, the easiest way would be to keep both as tables in the same database, and use transactions to keep the database consistent.

John Colanduoni
John Colanduoni
March 16, 2015 07:44 AM

First things first, does your codebase already offer a clean separation of concerns where the business logic (of picking players to play in the next match) is distinguished from your database access logic (a layer that simply connects to the database and maps your data structures into database rows and vice versa)? The answer for this will go a great deal in explaining why you are dealing with this:

It complicates the database design and it complicates the application design.

Now...

The application logic required to play the next match is already complicated enough but now the database has a very unhelpful design where the application is required to add "is deleted" check on every single query just to play the next match.

Assuming you are talking about RDBMS, you can still have a bitemporal database that captures all valid data past, present and possibly future, and then use a robust-enough database access library/ORM framework to handle the database querying logic for you. You may even use a database view to aid in your selection. Then, the business logic parts of your code should not need to know the underlying temporal fields, which will eliminate the problem you described above.

For example, instead of having to hard-code a SQL query in your application:

SELECT * FROM team_players WHERE team_id = ? AND valid_from >= ? AND valid_to <= ? AND ...

(using ? as parameter bindings)

A hypothetical database access library might let you query as (pseudo-code):

dbConnection.select(Table.TEAM_PLAYERS).match(Field.TEAM_ID, <value>).during(Season.NOW)

Or using the database view approach:

-- Using MySQL dialect for illustration
CREATE VIEW seasonal_players AS SELECT * FROM team_players 
    WHERE valid_from >= ... AND valid_to <= ... AND ...

You can then query for players during that period as:

SELECT * FROM seasonal_players WHERE team_id = ?

Back to your suggested database model, how do you intend to handle the removal of historical data from your operational database? Do you simply INSERT two similar rows into your operational and historical databases each, and then run a DELETE on the operational database whenever there is a user action to delete historical data?


Think Big

If you're talking about large-scale data processing where your 'database' is a scaled-out distributed database-clustering/stream-processing solution, your approach will sound vaguely similar (probably just on some of the defined terms) to the Lambda Architecture, in which your 'historical' (i.e. past-real-time) data is batch processed separately to perform the kind of statistics you are looking for, and your 'operational' (i.e. streaming real-time) data remains query-able for a predefined limit before the batch processing persists them. However, the basis of this approach is driven more by the advantages and limitations of current Big Data implementations, than mere simplification of one's application logic.


edit (after OP's edit)

I should have replied to this earlier, but anyways:

Also, I receive these 2 requirements as part of one single application that cannot be split but I keep thinking : this is 2 completely distinct applications. Why am I trying to do everything together.

That's generally because end-users tend to think in terms of features, not number of databases required.

You also mention that:

Step 1:

  • Each time an event happen, it is recorded automatically (audit solution) in an event table.
  • Then the correct row, in the operational database is updated.

Step 2:

  • A job reads the latest insertion in the event table and insert this new data in the historical database.

Great! So now you have an event table, which I presume really is the event sourcing concept mentioned by the other answers. However, what is it that reads your event table and update the correct row in the operational database? Is that the same as the job that reads the last event and inserts it into the historical database?

One more point regarding your constraint example:

I make heavy use of foreign key and constraint to make sure that at any point in time, the data that is in my database is valid.

Let's look at an example :

Constraint : There can only be one goal keeper per team. (active on the pitch during a game, just a side-note)

Is "any point in time" referring to valid time or transaction time?

What happens when we have a third new goalkeeper? Do you create an unique constraint on temporal fields in the historical database to keep the data for two "old" goalkeepers valid?

h.j.k.
h.j.k.
March 16, 2015 09:13 AM

It happens quite often, though the history (sometimes known as audit records) is kept either in the same table, or in a separate one in the same database.

For example, I used to work with a system where any updates to a table would be implemented as an insert, the old 'current' record would have a flag set on it saying it was a historical record, and the timestamp when it was updated written to a column.

Today I work on a system where every change is written to a dedicated audit table and then the update occurs on the table.

The latter is more scalable, but not as easy to implement in a generic manner.

The easiest way to achieve your goal of making queries simple and not requiring adding the 'is current' flag is to only allow read queries via a view or stored procedure. Then you make a call to say "get all players" and the stored proc will return only current players (you can implement a 2nd procedure to return players with more control over which ones are returned). This works well for writing too. A stored procedure for updating a player can then write whatever history details are needed and update the player - without the client ever knowing what the history mechanism is. For this reason, stored procedures are better than a view which returns only current players, as it keeps the entire DB access mechanism the same for reading and writing - everything goes through a sproc.

gbjbaanb
gbjbaanb
March 16, 2015 10:50 AM

By splitting a database into two databases you will lose all the benefits of relational references and referential integrity checking. I have never tried such a thing, but my guess is that it would become a major nightmare.

I believe that the entire data set which describes a certain system belongs to a single database. Issues of convenience in accessing the data are almost never a good reason for making data organization decisions.

Issues of convenience in accessing your data should be handled by making use of whatever convenience features are offered by your RDBMS.

So, instead of having a 'current' database and a 'historical' database, you should only have one database, and all tables in it should be prefixed with 'historical'. Then, you should create a set of views, one for each table that you want to see as 'current', and have each view filter out the historical rows that you do not want to see, and let only the current ones pass through.

This is a suitable solution to your problem because it makes use of a convenience feature of the RDBMS to address a convenience concern of the programmer, leaving the database design intact.

An example of a problem you are likely to run into (too long for a comment)

Suppose you are looking at a screen which is showing you current information about a team, say team.id=10, team.name="Manchester United", and you click on the button which says "show history". At that point, you will want to switch to a screen which shows historical information about that same team. So, you are going to take the id of 10 which you know that in the "current" database stands for "Manchester United", and you are going to have to hope that this id number of 10 also stands for "Manchester United" in the historical database. There is no referential integrity rule which enforces that the id refers to the exact same entity in both databases, so, essentially, you will have two entirely disjoint data sets with implicit connections which are only known by, honored by, and promised to be maintained by, code outside of the database.

And of course this holds true not only for major tables, like the "Teams" table, but even for the tiniest little table that you will have on the side, like "Player positions: Forward, Midfielder, Goalkeeper, etc."

Achieving historicity within the same database

There are various methods of maintaining historicity, and even though they are beyond the scope of this question, which is basically what pitfalls this particular idea of yours might have, here is an idea:

You can maintain a log table containing one entry for every single change that was ever made to the database. This way, all the "current" tables can be completely cleared of data and fully reconstructed by replaying the logged changes. Of course, if you can reconstruct the "current" tables by replaying the changes from the beginning of time until now, then you can also construct a temporary set of tables in order to get a view of the database at a specific time coordinate by replaying the changes from the beginning of time until that specific time coordinate.

This is known as "Event Sourcing" (Article by Martin Fowler.)

Mike Nakis
Mike Nakis
March 16, 2015 12:23 PM

Data versioning support in databases is a well-established topic and some DBMS's support this feature. I remember reading that MariaDB supports data versioning (https://mariadb.com/resources/blog/automatic-data-versioning-in-mariadb-server-10-3/) and a quick search unearthed something called OrpheusDB (https://medium.com/data-people/painless-data-versioning-for-collaborative-data-science-90cf3a2e279d)

Ramiro Carvalho
Ramiro Carvalho
October 09, 2019 18:24 PM

Related Questions


Updated April 10, 2018 14:05 PM

Updated September 04, 2019 07:05 AM

Updated April 10, 2018 12:05 PM

Updated October 26, 2018 10:05 AM

Updated July 17, 2019 14:05 PM