General architecture

Towards aiding the development of software that retrieves data from these new databases, we provide the FORWARD middleware query processor (see figure on the left).

Conceptually, each database appears to the client as a set of SQL++ virtual views.

In the simplest case, the client issues a SQL++ query over the data of a single database, and FORWARD translates this into the underlying database's native query language. In other more complex cases, given a SQL++ query that is not directly supported by the underlying database, FORWARD will decompose the SQL++ query into one or more native queries that are supported. Subsequently, FORWARD will combine the query results and compensate in the middleware for any semantics or capabilities discrepancies between SQL++ and the underlying database. Finally, in yet more complex cases, the client issues a SQL++ query that integrates the data of two or more databases, the contemporary use case being a SQL database and a non-SQL database.


The figure on the right shows an example of how the FORWARD middleware evaluates queries over different databases with varying capabilities. Consider a PostgreSQL database and a MongoDB database, where the PostgreSQL database contains a sensors table and MongoDB contains a measurements array of JSON objects. The FORWARD query processor presents to its clients the virtual SQL++ views V1 and V2 of these databases. Notice that virtual view V2 of MongoDB is identical to its native data representation. Since the views are virtual, FORWARD does not have a copy of the source data.

Suppose the client issues federated query Q, which finds the average temperature reported by any reliably functioning sensor in a specific lat-long bounding box, where a sensor is deemed reliable only if none of its measurements are outside the range -40°F to 140°F. The query is decomposed into PostgreSQL and MongoDB subqueries that are efficient and compatible with the limited query capabilities of MongoDB. In particular, FORWARD first issues to PostgreSQL the query Q1 that finds the ids of the sensors in the bounding box. Then, for each id, FORWARD issues to MongoDB the query Q2 that tests whether the sensor is reliable and, if it is, it issues a second query Q3 that finds the average of the temperature measurements. Notice that if MongoDB had supported nested queries, it would have been possible to issue a single MongoDB query for each id. Finally, the coord_to_state() function, which inputs coordinates and outputs the name of the corresponding state, is executed in the middleware.

Notice that delivering such a query plan requires a query optimizer which can decide that it is beneficial to first find the small set of sensor ids within the given bounding box, and then proceed to find measurements from MongoDB. Most interestingly, this optimizer must be aware of the limited query capabilities of the involved databases, so that the subqueries sent to a database are compatible with its capabilities.