Transaction and ACID Enforcement
Storage Requirements
Mutual Exclusion of File Creation
A storage supports mutual exclusion of file creation when only one writer wins if there are multiple writers trying to write to the same new file. This is the key feature that TrinityLake relies on for enforcing ACID semantics during the commit process.
This feature is widely available in most storage systems, for examples:
- On Linux File System through O_EXCL
- On Hadoop Distributed File System through atomic rename
- On Amazon S3 through IF-NONE-MATCH
- On Amazon DynamoDB through conditional PutItem
- On Google Cloud Storage through IF-NONE-MATCH
- On Azure Data Lake Storage through IF-NONE-MATCH
Consistency
The Consistency aspect of ACID is enforced in the storage system and out of the control of the TrinityLake format. The TrinityLake format assumes that you are using a storage system that is Strongly Consistent, i.e. all data operations are processed and reflected in a consistent order across a distributed system. For example, the TrinityLake format would not work as expected if you use it on eventually consistent systems like Apache Cassandra.
Durability
The Durability aspect of ACID is enforced in the storage system and out of the control of the TrinityLake format. For example, if you use the TrinityLake format on Amazon S3, you get 99.999999999% (11 9's) durability guarantee.
Immutable Copy-on-Write (CoW)
Modifying a TrinityLake tree means modifying the content of the existing node files and creating new node files. This modification process is Copy-on-Write (CoW), because "modifying the content" entails reading the existing content of the node file, and rewriting a completely new node file that contains potentially parts of the existing content plus the updated content.
When performing CoW, the node files are required to be created at a new file location, rather than overwriting an existing file. This means all node files are immutable once written until deletion.
Root Node File Name
With CoW, the root node file name is important because every change to the tree would create a new root node file, and the root node file name can be used essentially as the version of the tree.
TrinityLake defines that each root node has a numeric version number,
and the root node is stored in a file name _<version_number_binary_reversed>.ipc
.
The file name is persisted in storage as is without optimization.
For example, the 100th version of the root node file would be stored with name _00100110000000000000000000000000.ipc
.
Root Node Latest Version Hint File
A file with name _latest_hint
is stored and marks the hint to the latest version of the TrinityLake tree root node file.
The file name is persisted in storage as is without optimization
The file contains a number that marks the presumably latest version of the tree root node, such as 100
.
Read Isolation
Here we discuss how the Isolation aspect of ACID is enforced in TrinityLake.
A transaction, either for read or write or both, will always start with identifying the version of the TrinityLake tree to look into. This is determined by:
- Reading the version
_latest_hint
file if the file exists, or start from version 0 - Try to get files of increasing version number until the version
k
that receives a file not found error - The version
k-1
will be the one to decide the root node file name
All the object definition resolutions within the specific transaction must happen using that version of the TrinityLake tree.
Commit Atomicity
Here we discuss how the Atomicity aspect of ACID is enforced in TrinityLake.
When committing a transaction, the writer does the following:
- Apply changes and write all non-root node files
- Try to write to the root node file in the targeted root node file name
- If succeeded, the commit has succeeded, write the
_latest_hint
file with the new version with best effort. - If failed, the commit has failed, and the process will decide the best way to re-apply the changes, which can be either re-applying the metadata change against the new tree, or redo the entire operation, or anything in between depending on the implementation of the format.
Note
At step 3, the write of the _latest_hint
file is not guaranteed to exist or be accurate.
For example, if two processes A and B commit sequentially at version 2 and 3, but A wrote the hint slower than B,
the hint file will be incorrect with value 2. This is why in the Read Isolation section
we explicitly try to seek for further versions instead of just trust the value in the hint file.
Time Travel
Because the TriniyLake tree node is versioned, time travel against the tree root, i.e. time travel against the entire Trinity Lakehouse, is possible.
The engines should match the time travel ANSI-SQL semantics in the following way:
FOR SYSTEM_TIME AS OF
The timestamp-based time travel can be achieved by continuously tracing the previous root node key to older root nodes, and check the creation timestamp key until the right root node for time travel is found.
FOR SYSTEM_VERSION AS OF
When the system version is a numeric value, it should map to the version of the tree root node. The root node of the specific version can directly be found based on the root node file name.
When the system version is a string that does not resemble a numeric value, it should map to a possible exported snapshot.
Rollback Committed Version
TrinityLake uses the roll forward technique for rolling back any committed version.
If the current latest root node version is v
, and a user would like to rollback to version v-1
,
Rollback is performed by committing a new root node with version v+1
which is most identical to the root node file v-1
,
with the difference that the root node v
should be recorded as the rollback root node key.
Snapshot Export
A snapshot export for a Trinity Lakehouse means to export a specific version of the TrinityLake tree root node, and all the files that are reachable through that root node.
Every time an export is created, the Lakehouse definition should be updated to record the name of the export and the root node file that the export is at.
There are many types of export that can be achieved, because the export process can decide to stop replication at any level of the tree and call it an export. At one extreme, a process can replicate any reachable files starting at the root node. We call this a Full Export. On the other side, a process can simply replicate the specific version of tree root node, and all other files reachable from the root node are not replicated. We call this a Minimal Export. We call any export that is in between a Partial Export.
Any file that is referenced by both the exported snapshot and the source Lakehouse might be removed by the Lakehouse version expiration process. With a full snapshot export, all files are replicated and dereferenced from the source Lakehouse. With a partial or minimal export, additional retention policy settings are required to make sure the version expiration process still keep those files available for a certain amount of time.