Current status of APDB and PPDB implementation#

Abstract

The Alert Production DataBase (APDB) and Prompt Products DataBase (PPDB) design has been evolving for the past few years as a result of various tests and updated requirements. This technical note describes the current state of APDB and PPDB implementation and the details that affect the design of the system. Concerns related to scalability of both parts of the system are summarized, and missing parts of the design are identified.

Abridged History#

APDB is responsible for storage of the DIA records produced by the Alert Production (AP) pipeline. It was realized early on that the performance of APDB is going to be critical for fulfilling the science requirement on Alert Production. The performance goal for alert production is set to 60 seconds delay from the end of exposure to issuing alerts for that visit. A significant fraction of those 60 seconds will be naturally consumed by the image analysis software, and interaction with APDB ideally should use only a small fraction of that time.

The AP pipeline processes data from each detector in parallel, which translates into 189 individual processes reading and writing into APDB concurrently. Initial experiments with an APDB implementation based on SQL backends (DMTN-113 [Salnikov, 2019]) demonstrated that even with very fast SSD storage a single SQL server is not able to handle this sort of load. It was also clear that to accommodate larger future volumes of data APDB should be based on a horizontally-scalable architecture, preferably with a built-in high-availability mechanisms. One attractive option for such backend was a NoSQL database Apache Cassandra.

There were multiple tests performed with a Cassandra-backed implementation (DMTN-156 [Salnikov, 2020], DMTN-184 [Salnikov, 2021]) which demonstrated a reasonable performance and scaling behavior when using fast NVMe-based storage. Based on these test it was projected that a Cassandra cluster with approximately 10 nodes would be able to handle an initial period of Rubin operations and will allow us to collect enough operational experience to plan future possible upgrades. The USDF has commissioned a Cassandra cluster of 12 nodes with reasonable storage which is used today for further testing.

Most of the tests mentioned above only tested part of the APDB functionality relevant to AP pipelines and using synthetic data generated by a very simple algorithm. For more realistic studies we will need to run actual AP pipeline code at a scale which is close to the actual scale of operations with LSSTCam.

The data model in Cassandra-based APDB is optimized for efficient queries by the AP pipeline and it cannot support other types of queries efficiently. For those queries we need to transfer all APDB data into PPDB which is assumed to be a standard PostgreSQL database. Initial ideas for replicating APDB data to PPDB are discussed in DMTN-268 [Salnikov, 2023] with implementation of these ideas being currently in development.

In addition to serving AP pipelines and PPDB, APDB will also be the source of data for Solar System Processing (SSP) and will receive updates from that system, which will happen during the daytime. How exactly APDB can satisfy additional requirements from Solar System Processing is still under consideration. There was a meeting at JTM 2024 dedicated to the discussion of those requirements, though there were no immediate decision made at that meeting. As the APDB data model is not optimal for non-AP queries, one possible approach for SSP is to use PPDB as the source of input data, but it is not clear whether PPDB can handle it reasonably with the current table schema.

Current System Architecture#

It should be clear from the above history that some parts ot he current architecture are defined better than others. In particular the interface between the AP pipeline and APDB is in a reasonable shape and it was tested extensively, mostly with SQL backend but also recently with the USDF Cassandra cluster. Other parts are reasonably known but many aspects or details are not well defined at this point. One very important point about the whole system is that we have to make sure that the states of APDB and PPDB are eventually consistent. This implies that APDB and PPDB logically make a single distributed database, and the tools that manage data in APDB/PPDB have to guarantee consistency of this database.

Fig. 1 and Fig. 2 depict current competing ideas about what the architecture could look like for a production system, reflecting two possible options for propagating updates from Solar System Processing. Some details about components of the architecture are discussed in the following sections.

_images/system-architecture1.svg

Fig. 1 High-level overview of one variant of APDB/PPDB architecture. Blue arrows represent data flow during observations, red arrows show daytime data movement. The daily update process receives updated records from SSP, e.g., as a set of files, writes them to regular APDB tables and also adds data to replica tables for propagation to PPDB.#

_images/system-architecture2.svg

Fig. 2 High-level overview of another variant of APDB/PPDB architecture. SSP updates PPDB directly, and a daily update process reads PPDB and writes only to regular APDB tables.#

AP pipeline#

The AP pipeline runs image analysis concurrently for each detector in a visit which means there will be 189 tasks executing in parallel and accessing APDB. Processing of the subsequent visits may overlap as AP will preload data from APDB before actual image data becomes available, and possibly due to longer time needed to process some visits. This may result in an even higher number of concurrent clients trying to either read or write to APDB.

One potential performance issue with pipeline tasks is that client connection establishment is a heavier process compared to other non-distributed services. For performance, clients need to communicate with every server in a Cassandra cluster, this involves a service discovery phase in the connection protocol. Ideally, with a large number of clients and connections, it would be better if clients had long-term connections to the cluster, but this is not possible with the AP pipelines. An additional complication to that is that the pipeline is constructed from individual tasks and more than one task in a pipeline will need to access APDB.

Cassandra APDB#

APDB’s main purpose is to serve AP pipeline queries in the most efficient way. To satisfy that goal the schema of the main tables used by AP (DiaObject, DiaSource, DiaForcedSource) has been optimized:

  • partitioning and clustering of these tables is defined to match the queries used by AP pipeline tasks,

  • a special table was added to the schema to keep the latest version of DiaObject records.

To avoid unconstrained growth of data in DiaSource and DiaForcedSource tables we plan periodic cleanups of the data outside the 12-month window.

APDB also serves as a source of data for downstream PPDB, but the optimized schema of the main tables cannot serve the types of queries that are used for copies of the recent data. To facilitate an efficient transfer of the recently produced data to PPDB, an additional set of replica chunk tables is used whose partitioning and clustering is more optimal for transfers to PPDB. When the AP pipeline writes data to APDB, it duplicates the data by writing to the regular tables and optionally into the replica chunks tables. If replication tables are enabled APDB has an additional option to avoid writing a regular DiaObject table which is not used by the AP pipeline (the table with latest version of DiaObjects is used by AP), which saves storage space.

Replication process#

A separate replication service is responsible for moving recently added APDB data to PPDB. APDB writes into its replication tables which are partitioned by the chunk window, which is typically 10 minutes. When a chunk is complete it can be transferred to PPDB.

The replication process watches APDB for its list of existing chunks and decides when a new chunk can be transferred to PPDB. It also keeps a list of already transferred chunks in a separate table in PPDB.

Replica chunks that have been transferred to PPDB can be deleted from APDB. The same replication process is responsible for this cleanup, which happens with some delay (usually a few days).

PPDB#

There was not much discussion or decisions about what exactly PPDB is going to be, but general assumption was that this will be a relational database, most likely PostgreSQL, so that it can support arbitrary queries. The amount of data in PPDB will be very large and ever growing as more data is generated by the AP pipeline. It is unlikely that a single PostgreSQL server can handle arbitrary requests from multiple clients on a scale of data volume that PPDB is supposed to contain.

A separate aspect of having a very large SQL database is potential schema changes. It is reasonable to expect that the database schema for both APDB and PPDB will evolve over the lifetime of the Rubin survey. If the database schema upgrade involves data updates (e.g., filling new columns with non-default data) then the process of upgrading can take a very long time. Such upgrades will require very careful planning to avoid, if possible, changes that would result in extended downtime for schema updates. Related to that is that Cassandra schema changes are more limited than for SQL databases, this can impose additional constraints on future schema changes.

Solar System Processing#

SSP needs a list of “unassociated” DIASources from the last two weeks (possibly up to four weeks) as an input to its daily processing job. The query to find all such records cannot be reasonably run on the regular DIA tables in Cassandra as it implies a whole-sky scan which will be too slow for realistic response times. One option for running these sorts of queries is to use PPDB, though that will likely require additional special indices on the DiaObject table. It is not very clear if PPDB can handle these sorts of queries without doing full table scans and some research will be needed to understand that.

A possible alternative would be to extend the schema, either in Cassandra APDB or in PPDB, with additional tables optimized for SSP queries. These tables can be populated or cleared when AP saves new data to APDB, if the volume of data in those tables is limited then impact on APDB performance may be negligible. Further research is necessary to understand which of the options can work better at scale.

The output of SSP is a set of records for MPCORB, SSObject, and SSSource tables. Contents of the first two tables is re-computed completely every day, and can contain a million records at the start of the survey or a few millions at the end of the survey. SSSource records produced by SSP are in addition to the existing records already in the database.

Daily APDB updates#

Before the start of the next night’s observing the contents of APDB and PPDB need to be updated with the results of daily processing. The updates include:

  • replacing the MPCORB and SSObject tables,

  • inserting new records into the SSSource table,

  • re-associating some DIASources with SSObjects,

  • possibly removing some DIAObjects,

  • withdrawing some DIAForcedSources.

It is likely that the SSSource table is not needed for AP pipeline queries, in that case APDB does not need to include it. Some update operations in Cassandra cannot be implemented trivially and may need additional tables to support those updates.

There are two possible options for performing daily updates:

  • send all updates to APDB first and then propagate them to PPDB via the regular replication process,

  • store all updates in PPDB, then query PPDB and propagate them to APDB.

There are pros and cons for each of these approaches. In the first case APDB becomes the only authoritative source for all information and PPDB is only updated by a single agent. This should simplify PPDB implementation as it will not need any additional structures to track updates that need to be sent to APDB later. SSP and other actors could potentially send their updates directly to APDB using an extended APDB API, and avoiding intermediate agents. On the other hand this option will make the replication process more complex as it will need to handle different types of updates. The second option has an advantage in that it reduces the need to track updates in Cassandra, which may be beneficial for its performance; but it also shifts parts of the logic of update handling to PPDB which shifts its role from a passive data receiver to a more complex data source.

Unresolved issues#

While there has been significant progress with the APDB implementation and with recent tests with Cassandra, there are still a number of issues that are not fully understood:

APDB scalability

All scalability tests so far have been performed with a synthetic workload using simulated data. For a better estimate of APDB performance it would be better to test it with more realistic payload from the actual AP pipeline. Several months worth of data, and optimally a year of data, if possible, is needed for a reasonable estimate.

Cassandra operations

Cassandra has multiple high-availability features that are critical for uninterrupted AP operations. While these features are normally transparent to clients, the operational aspect of cluster management needs to be understood and tested using various possible failure scenarios.

Replication service

There exists an initial implementation of the replication process for moving data from APDB to PPDB which can run as a command line application. This implementation will have to be extended to cover the possible different replication options mentioned above. Additionally it needs to be transformed into an actual configurable service which can run as a part of an APDB/PPDB deployment.

PPDB scalability

PPDB will contain a huge volume of data. While the size of the storage itself may not be an problem, the performance of the queries at such scale will definitely be an issue. It is very likely that a single PostgreSQL is not going to be adequate to serve all science users. A distributed SQL database, possibly based on PostgreSQL, may provide better performance, though usually distributed SQL databases have some limitations in supported SQL features.

SSP and daily updates

There was little progress in understanding the SSP-APDB interface after the initial discussion of SSP needs. Significant effort is needed to figure out the exact requirements from the SSP side and to translate those requirements into a reasonable design. SSP-APDB interface may also affect the decision on how daily updates can be implemented, to chose between two options mentioned above.

References#

[1]

Andy Salnikov. Performance of RDBMS-based PPDB implementation. March 2019. Vera C. Rubin Observatory Data Management Technical Note DMTN-113. URL: https://dmtn-113.lsst.io/.

[2]

Andy Salnikov. Performance of Cassandra-based APDB implementation. July 2020. Vera C. Rubin Observatory Data Management Technical Note DMTN-156. URL: https://dmtn-156.lsst.io/.

[3]

Andy Salnikov. Testing Cassandra APDB implementation on GCP. June 2021. Vera C. Rubin Observatory Data Management Technical Note DMTN-184. URL: https://dmtn-184.lsst.io/.

[4]

Andy Salnikov. Data replication between APDB and PPDB. August 2023. Vera C. Rubin Observatory Data Management Technical Note DMTN-268. URL: https://dmtn-268.lsst.io/.