Summary
- State compression on Solana is primarily used for compressed NFTs (cNFTs), but it can be applied to any data type
- State Compression lowers the amount of data you have to store onchain using Merkle trees.
- A Merkle tree compresses data by hashing pairs of data repeatedly until a single root hash is produced. This root hash is then stored onchain.
- Each leaf on a Merkle tree is a hash of that leaf’s data.
- A concurrent Merkle tree is a specialized version of a Merkle tree. Unlike a standard Merkle tree, it allows multiple updates simultaneously without affecting transaction validity.
- Data in a state-compressed program is not stored onchain. So you have to use indexers to keep an offchain cache of the data. It’s this offchain cache data that is used to then verify against the onchain Merkle tree.
Lesson
Previously, we talked about state compression in the context of compressed NFTs.
While compressed NFTs are the main use case for state compression, you can apply state compression to any Solana program. In this lesson, we’ll discuss state compression in general terms so you can use it across your Solana projects.
A theoretical overview of state compression
Normally, data in Solana programs is serialized (usually with borsh) and stored directly in an account. This makes it easy to read and write the data through the program. The account data is trustworthy because only the program can modify it.
However to verify the integrity of the data, then there’s no need to store the actual data onchain. Instead, we can store hashes of the data, which can be used to prove or verify its accuracy. This is called state compression.
These hashes take up far less storage space than the original data. The full data can be stored in a cheaper, offchain location, and only needs to be verified against the onchain hash when accessed.
The Solana State Compression program uses a program known as a concurrent Merkle tree. A concurrent Merkle tree is a special kind of binary tree that deterministically hashes data, i.e. the same inputs will always produce the same Merkle root.
The final hash, called a Merkle root, is significantly smaller in size than all the original full data sets combined. This is why it’s called "compression". And it’s this hash that’s stored onchain.
Outlined below are the steps to this process, in order:
- Take a piece of data.
- Create a hash of that data.
- Store the hash as a "leaf" at the bottom of the tree.
- Hash pairs of leaves together to create branches.
- Hash pairs of branches together.
- Repeat this process until you reach the top of the tree.
- The top of the tree contains a final "root hash."
- Store this root hash onchain as proof of the data.
- To verify the data, recompute the hashes and compare the final hash to the onchain root hash.
This method comes with some trade-offs:
- The data isn’t stored onchain, so it’s harder to access.
- Developers must decide how often to verify the data against the onchain hash.
- If the data changes, the entire data set must be sent to the program, along with the new data. You’ll also need proof that the data matches the hash.
These considerations will guide you when deciding whether, when, and how to implement state compression in your programs. With that quick overview, let’s go into more technical detail.
Concurrent Merkle trees
Since a Merkle tree is represented as a single hash, any change to a leaf node alters the root hash. This becomes problematic when multiple transactions in the same slot try to update leaf data in the same slot. Since transactions are executed serially i.e. one after the other — all but the first will fail since the root hash and proof passed in will have been invalidated by the first transaction executed.
In short, a standard Merkle tree can only handle one leaf update per slot. This significantly limits the throughput in a state-compressed program that depends on a single Merkle tree for its state.
Thankfully, this issue can be addressed using a concurrent Merkle tree. Unlike a regular Merkle tree, a concurrent Merkle tree keeps a secure changelog of recent updates, along with their root hash and the proof needed to derive it. When multiple transactions in the same slot attempt to modify leaf data, the changelog serves as a reference, enabling concurrent updates to the tree.
How does the concurrent Merkle tree achieve this? In a standard Merkle tree, only the root hash is stored. However, a concurrent Merkle tree includes extra data that ensures subsequent writes can succeed.
This includes:
- The root hash - The same root hash found in a regular Merkle tree.
- A changelog buffer - A buffer containing proof data for recent root hash changes, allowing further writes in the same slot to succeed.
- A canopy - To update a specific leaf, you need the entire proof path from the leaf to the root hash. The canopy stores intermediate proof nodes along this path so that not all of them need to be sent from the client to the program.
Key Parameters for Configuring a Concurrent Merkle Tree
As a developer, you are responsible for controlling three key parameters that directly affect the tree’s size, cost, and the number of concurrent changes it can handle:
- Max Depth
- Max Buffer Size
- Canopy Depth
Let’s take a brief overview of each parameter.
Max Depth
The max depth determines how many levels or "hops" are required to reach the
root of the tree from any leaf. Since Merkle trees are structured as binary
trees, where each leaf is paired with only one other leaf, the max depth can be
used to calculate the total number of nodes in the tree with the formula:
2^maxDepth
.
Here’s a quick TypeScript function for illustration:
A max depth of 20 would allow for over one million leaves, making it suitable for storing large datasets like NFTs.
Max Buffer Size
The max buffer size controls how many concurrent updates can be made to the tree within a single slot while keeping the root hash valid. In a standard Merkle tree, only the first transaction in a slot would be successful since it updates the root hash, causing all subsequent transactions to fail due to hash mismatches. However, in a concurrent Merkle tree, the buffer maintains a log of changes, allowing multiple transactions to update the tree simultaneously by checking the appropriate root hash from the buffer. A larger buffer size increases throughput by enabling more concurrent changes.
Canopy Depth
The canopy depth specifies how many proof nodes are stored onchain for any given proof path. To verify any leaf in the tree, you need a complete proof path, which includes one proof node for every layer of the tree. For a tree with a max depth of 14, there will be 14 proof nodes in total. Each proof node adds 32 bytes to the transaction, and without careful management, large trees could exceed the transaction size limit.
Storing more proof nodes onchain (i.e., having a deeper canopy) allows other programs to interact with your tree without exceeding transaction limits, but it also uses more onchain storage. Consider the complexity of interactions with your tree when deciding on an appropriate canopy depth.
Balancing Trade-offs
These three values—max depth, max buffer size, and canopy depth—all come with trade-offs. Increasing any of them will enlarge the account used to store the tree, raising the cost of creating the tree.
- Max Depth: This is straightforward to determine based on how much data
needs to be stored. For example, if you need to store 1 million compressed
NFTs (cNFTs), where each cNFT is a leaf, you would need a max depth of 20
(
2^maxDepth > 1 million
). - Max Buffer Size: The choice of buffer size is mainly a question of throughput—how many concurrent updates are required? A larger buffer allows for more updates in the same slot.
- Canopy Depth: A deeper canopy improves composability, enabling other programs to interact with your state-compressed program without exceeding transaction size limits. Omitting the canopy is discouraged, as it could cause issues with transaction size, especially when other programs are involved.
Data Access in a State-Compressed Program
In a state-compressed program, the actual data isn’t stored directly onchain. Instead, the concurrent Merkle tree structure is stored, while the raw data resides in the blockchain’s more affordable ledger state. This makes accessing the data more challenging, but not impossible.
The Solana ledger is essentially a list of entries containing signed transactions, which can be traced back to the Genesis block theoretically. This means any data that has ever been included in a transaction is stored in the ledger.
Since the state compression process happens onchain, all the data is still in the ledger state. In theory, you could retrieve the original data by replaying the entire chain state from the start. However, it’s far more practical (though still somewhat complex) to use an indexer to track and index the data as the transactions happen. This creates an offchain "cache" of the data that can be easily accessed and verified against the onchain root hash.
While this process may seem complex at first, it becomes clearer with practice.
State Compression Tooling
While understanding the theory behind state compression is crucial, you don’t have to build it all from scratch. Talented engineers have already developed essential tools like the SPL State Compression Program and the Noop Program to simplify the process.
SPL State Compression and Noop Programs
The SPL State Compression Program is designed to streamline and standardize the creation and management of concurrent Merkle trees across the Solana ecosystem. It provides Instruction Handlers for initializing Merkle trees, handling tree leaves (such as adding, updating, or removing data), and verifying the integrity of leaf data.
Additionally, the State Compression Program works in conjunction with a separate "Noop" program. A no-op program does nothing - literally 'no operation.' The Solana Noop Program only logs data to the ledger state, however that logging is essential to state compression:
When you store compressed data, it’s passed to the State Compression Program, which hashes the data and emits it as an "event" to the Noop Program. While the hash is stored in the concurrent Merkle tree, the raw data can still be accessed via the Noop Program’s transaction logs.
Indexing Data for Easy Lookup
Typically, accessing onchain data is as simple as fetching the relevant account. However, with state compression, it’s not that straightforward.
As mentioned earlier, the data now resides in the ledger state rather than in an account. The most accessible place to find the complete data is in the logs of the Noop instruction. While this data remains in the ledger state indefinitely, it may become inaccessible through validators after a certain period.
Validators don’t store all transactions back to the Genesis block to save space and improve performance. The length of time you can access Noop instruction logs varies depending on the validator. Eventually, the logs will become unavailable if you’re relying on direct access to them.
In theory, it’s possible to replay transaction states back to the genesis block, but this approach is impractical for most teams and isn’t efficient. Some RPC providers have adopted the Digital Asset Standard (DAS) to enable efficient querying of compressed NFTs and other assets. However, as of now, DAS does not support arbitrary state compression.
You essentially have two main options:
- Use an indexing provider to create a custom indexing solution for your program, which will monitor the events sent to the Noop program and store the relevant data offchain.
- Build your indexing solution that stores transaction data offchain.
For many dApps, option 2 can be a practical choice. Larger-scale applications, however, may need to rely on infrastructure providers to manage their indexing needs.
State Compression Development Process
Create Rust Types
In a typical Anchor program, developers often start by defining the Rust types that represent accounts. For a state-compressed program, however, the focus shifts to defining types that align with the Merkle tree structure.
In state compression, your onchain account will primarily store the Merkle tree.
The more practical data will be serialized and logged to the Noop program for
easier access and management. Your Rust types should encompass all data stored
in the leaf nodes and any contextual information necessary for interpreting that
data. For instance, if you’re developing a simple messaging program, your
Message
struct might look something like this:
To be absolutely clear, the MessageLog
is not an account you will read
from. Instead, your program will create an instance of MessageLog
using
inputs from Instructions Handler, rather than constructing it from data read
from an account. We will cover how to read data from compressed accounts later.
Initialize a New Tree
To set up a new Merkle tree, clients need to perform two distinct steps.
- First, they allocate the account by calling the System Program.
- Next, they use a custom program to initialize the new account. This initialization involves setting the maximum depth and buffer size for the Merkle tree.
The initialization Instruction Handler must create a CPI (Cross-Program
Invocation) to call the init_empty_merkle_tree
instruction from the State
Compression Program. You’ll need to provide the maximum depth and buffer size as
arguments to this instruction Handler.
- Max depth: Defines the maximum number of hops needed to travel from any leaf to the root of the tree.
- Max buffer size: Specifies the space allocated for storing a changelog of tree updates. This changelog is essential for supporting concurrent updates within the same block.
For instance, if you are initializing a tree to store messages between users, your Instruction Handler might look like this:
Adding Hashes to the Tree
Once the Merkle tree is initialized, you can begin adding data hashes to it.
This process involves passing the uncompressed data to an Instruction handler
within your program, which will hash the data, log it to the Noop Program, and
then use the State Compression Program’s append
instruction to add the hash to
the tree. Here’s how the Instruction Handler operates in detail:
-
Hash the Data: Use the
hashv
function from thekeccak
crate to hash the data. It’s recommended to include the data owner or authority in the hash to ensure that only the proper authority can modify it. -
Log the Data: Create a log object representing the data you want to log to the Noop Program. Then, call
wrap_application_data_v1
to issue a CPI (Cross-Program Invocation) to the Noop Program with this object. This makes the uncompressed data easily accessible to any client, such as indexers, that may need it. You could also develop a custom client to observe and index data for your application specifically. -
Append the Hash: Construct and issue a CPI to the State Compression Program’s
append
Instruction. This will take the hash generated in step 1 and append it to the next available leaf on the Merkle tree. As with previous steps, this requires the Merkle tree address and tree authority bump as signature seeds.
When applied to a messaging system, the resulting implementation might look like this:
Updating Hashes
To update a leaf in a Merkle tree, you’ll need to generate a new hash to replace the existing one. This process requires four key inputs:
- The index of the leaf you wish to update
- The root hash of the Merkle tree
- The original data you want to modify
- The updated data
Using these inputs, you can follow a series of steps similar to those used when initially appending data to the tree:
-
Verify Update Authority: The first step, unique to updates, is to verify the authority of the entity making the update. This generally involves checking that the signer of the
update
transaction is indeed the owner or authority of the leaf at the specified index. Since the data in the leaf is hashed, you can’t directly compare the authority’s public key to a stored value. Instead, compute the previous hash using the old data and theauthority
listed in the account validation struct. Then, invoke a CPI to the State Compression Program’sverify_leaf
instruction to confirm the hash matches. -
Hash the New Data: This step mirrors the hashing process for appending data. Use the
hashv
function from thekeccak
crate to hash the new data and the update authority, converting each to its corresponding byte representation. -
Log the New Data: As with the initial append operation, create a log object to represent the new data, and use
wrap_application_data_v1
to invoke the Noop Program via CPI. This ensures that the new uncompressed data is logged and accessible offchain. -
Replace the Existing Leaf Hash: This step is slightly different from appending new data. Here, you’ll need to invoke a CPI to the State Compression Program’s
replace_leaf
instruction. This operation will replace the existing hash at the specified leaf index with the new hash. You’ll need to provide the old hash, the new hash, and the leaf index. As usual, the Merkle tree address and tree authority bump are required as signature seeds.
When combined, the instructions for updating a hash might look like this:
Deleting Hashes
As of now, the State Compression Program does not have a dedicated delete
instruction.
Instead, you can simulate deletion by updating the leaf data with a value that signals it has been "deleted."
The exact value you choose will depend on your specific use case and security requirements. For some, this may involve setting all data fields to zero, while others might prefer storing a predefined static string that marks the leaf as deleted. This approach allows you to handle deletions in a way that suits your application’s needs without compromising data integrity.
Accessing Data from a Client
We’ve covered creating, updating, and deleting data in state compression, but reading data presents its unique challenges.
Accessing compressed data from a client can be tricky because the Merkle tree stores only data hashes, which cannot be used to recover the original data. Additionally, the uncompressed data logged to the Noop program is not retained indefinitely.
To access this data, you generally have two options:
- Work with an indexing provider to develop a custom solution tailored to your program. This allows you to write client-side code to retrieve and access the data based on how the indexer provides it.
- Create your own pseudo-indexer to store and retrieve the data, offering a lighter-weight solution.
If your project is decentralized and expects widespread interaction beyond your frontend, option 2 might not be sufficient. However, if you have control over most program interactions, this approach can work.
There’s no one-size-fits-all solution here. Two potential strategies include:
-
Store raw data: One approach is to store the raw data in a database simultaneously by sending it to the program. This allows you to keep a record of the data, along with the Merkle tree leaf where the data was hashed and stored.
-
Create a transaction observer: Another approach is to create a server that observes the transactions your program executes. This server would fetch transactions, look up the related Noop logs, decode them, and store the data.
When writing tests in the lab, we’ll simulate both of these approaches, although instead of using a database, the data will be stored in memory for the test’s duration.
The process of setting this up can be a bit complex. For a given transaction,
you’ll retrieve it from the RPC provider, extract the inner instructions related
to the Noop program, and use the deserializeApplicationDataEvent
function from
the @solana/spl-account-compression
JS package to decode the logs. Then,
you’ll use Borsh to deserialize the data. Here’s an example from the messaging
program to illustrate the process:
Conclusion
Implementing generalized state compression may be challenging, but it is entirely achievable using the available tools. As the ecosystem evolves, these tools and programs will continue to improve, making the process more streamlined. If you discover solutions that enhance your development experience, please don’t hesitate to share them with the community!
Remember to write comprehensive tests for your state compression implementation. This ensures your program behaves correctly and helps catch potential issues early in the development process.
Lab: Building a Note-Taking App with Generalized State Compression
In this lab, we’ll walk through the process of developing an Anchor program that uses custom state compression to power a basic note-taking app. This will give you hands-on experience in working with compressed data and help reinforce key concepts around state compression on Solana.
1. Set up the Project
Start by initializing an Anchor program:
Next, we’ll add the spl-account-compression
crate with the cpi
feature
enabled. To do this, update the Cargo.toml
file located at
programs/compressed-notes
by adding the following dependency:
We’ll be running tests locally, but we’ll need both the State Compression
Program and the Noop Program from the Mainnet to do so. To make sure these
programs are available on our local cluster, we need to include them in the
Anchor.toml
file located in the root directory. Here’s how you can add them:
In Anchor.toml
, update the programs section with the following entries:
Finally, let’s set up the lib.rs
file for the remainder of the demo. Start by
removing the initialize
instruction and the Initialize
accounts struct.
Next, add the necessary imports as indicated in the code snippet, making sure to
include your program ID.
For the remainder of this demo, we’ll be making updates directly in the lib.rs
file. This approach simplifies the explanations. You can modify the structure as
needed.
It’s a good idea to build your project now to confirm that your environment is set up correctly and to reduce build times in the future.
2. Define Note
schema
Next, we’ll define the structure of a note within our program. Each note should have the following attributes:
leaf_node
- a 32-byte array representing the hash stored on the leaf node.owner
- the public key of the note’s owner.note
- a string containing the text of the note.
In a traditional Anchor program, a note would typically be represented by a
Note
struct using the account
macro. However, because we’re using state
compression we use NoteLog
, a struct with the AnchorSerialize
macro applied.
3. Define Account Constraints
All our instruction handlers will use the same account constraints:
owner
- The creator and owner of the note, who must sign the transaction.tree_authority
- The authority for the Merkle tree, used for signing compression-related CPIs.merkle_tree
- The address of the Merkle tree where note hashes are stored; this will be unchecked as it’s validated by the State Compression Program.log_wrapper
- The address of the Noop Program.compression_program
- The address of the State Compression Program.
4. Create create_note_tree
Instruction handler
Next, we’ll make the create_note_tree
instruction handler, to initialize the
already allocated Merkle tree account.
To implement this, you’ll need to build a CPI to invoke the
init_empty_merkle_tree
instruction from the State Compression Program. The
NoteAccounts
struct will provide the necessary accounts, but you’ll also need
to include two additional arguments:
max_depth
- Specifies the maximum depth of the Merkle tree, indicating the longest path from any leaf to the root.max_buffer_size
- Defines the maximum buffer size for the Merkle tree, which determines the space allocated for recording tree updates. This buffer is crucial for supporting concurrent updates within the same block.
Make sure that when setting up your CPI, you include both the Merkle tree address and the tree authority bump in the signer seeds.
5. Create append_note
Instruction handler
Let’s create the append_note
instruction handler. This will compress a raw
note into a hash and store it on the Merkle tree, while also logging the note to
the Noop program to ensure all data remains available onchain.
Here’s how to accomplish this:
-
Hash the Data: Utilize the
hashv
function from thekeccak
crate to compute a hash of the note and the owner’s public key. Both should be converted to their byte representations. It’s essential to hash the owner along with the note to facilitate ownership verification during updates. -
Log the Data: Create a
NoteLog
instance with the hash from step 1, the owner’s public key, and the note as aString
. Then, usewrap_application_data_v1
to issue a CPI to the Noop program with thisNoteLog
instance. This ensures the complete note (not just the hash) is available to clients, similar to how indexers manage cNFTs. You might also develop an observing client to simulate indexer functionality specific to your application. -
Append to the Merkle Tree: Build and issue a CPI to the State Compression Program’s
append
instruction. This will add the hash from step 1 to the next available leaf on your Merkle tree. Ensure that the Merkle tree address and the tree authority bump are included as signature seeds.
6. Create update_note
Instruction Handler
The final instruction we’ll implement is update_note
, which will replace an
existing leaf with a new hash that represents the updated note data.
To perform this update, you’ll need the following parameters:
- Index: The index of the leaf to be updated.
- Root: The root hash of the Merkle tree.
- Old Note: The string representation of the note that is being updated.
- New Note: The string representation of the updated note.
The process for this instruction is similar to append_note
, with some
additional steps:
-
Verify Ownership: Before updating, prove that the
owner
executing this instruction is the rightful owner of the leaf at the specified index. Since the leaf data is compressed as a hash, you can’t directly compare theowner
's public key. Instead, compute the previous hash using the old note data and theowner
from the account validation struct. Then, use this computed hash to build and issue a CPI to the State Compression Program’sverify_leaf
instruction. -
Hash the New Data: Hash the new note and the owner’s public key using the
hashv
function from thekeccak
crate, converting each to its byte representation. -
Log the New Data: Create a
NoteLog
instance with the new hash from step 2, the owner’s public key, and the new note. Callwrap_application_data_v1
to issue a CPI to the Noop program with thisNoteLog
instance, ensuring the updated note data is available to clients. -
Replace the Leaf: Build and issue a CPI to the State Compression Program’s
replace_leaf
instruction. This will replace the old hash with the new hash at the specified leaf index. Ensure the Merkle tree address and the tree authority bump are included as signature seeds.
7. Client Test Setup
To ensure our program functions correctly, we’ll set up and write some tests. Here’s what you need to do for the setup:
- Install Dependencies: We’ll be using the
@solana/spl-account-compression
package for our tests. Install it using the following command:
- Create Utility File: To simplify testing, we’ve provided a utility file.
Create a
utils.ts
file in thetests
directory and add the provided contents. We’ll go over the details of this file shortly.
The utils.ts
file contains three key components:
-
NoteLog
Class: This class represents the note log that we’ll extract from the Noop program logs. It also includes the Borsh schema, namedNoteLogBorshSchema
, which is used for deserialization. -
getHash
Function: This function generates a hash from the note and its owner, allowing us to compare it against the data in the Merkle tree. -
getNoteLog
Function: This function searches through the transaction logs to locate the Noop program logs then deserializes and retrieves the correspondingNoteLog
.
8. Write Client Tests
With our packages and utility file set up, we’re ready to dive into writing the tests. We will create four tests for our program:
- Create Note Tree: This test will initialize the Merkle tree for storing note hashes.
- Add Note: This test will invoke the
append_note
instruction to add a note to the tree. - adds max size note to the Merkle tree: This test will also use the
append_note
instruction, but with a note that reaches the maximum allowable size of 1232 bytes in a single transaction. - Updates the first note in the Merkle tree: This test will use the
update_note
instruction to modify the first note that was added.
The first test is mainly for setup purposes. For the remaining three tests, we will check that the note hash in the Merkle tree matches the expected value based on the note content and the signer.
We’ll start by setting up our imports. This includes a variety of components
from Anchor, @solana/web3.js
, @solana/spl-account-compression
, and our own
utility functions.
Next, we’ll set up the state variables needed for our tests. This setup will include:
- Default Anchor Setup: Configure the basic environment for Anchor testing.
- Merkle Tree Keypair: Generate a keypair for the Merkle tree.
- Tree Authority: Create a keypair for the authority of the Merkle tree.
- Notes: Define some sample notes to use in the tests.
Now, let’s dive into the Create Note Tree
test. This test will accomplish two
key tasks:
- Allocate a New Merkle Tree Account: Create a new account for the Merkle tree, specifying a max depth of 3, a max buffer size of 8, and a canopy depth of 0.
- Initialize the Account: Use our program’s
createNoteTree
instruction to set up the newly allocated Merkle tree account.
That’s a wrap—congratulations! Run anchor test
, and you should see all four
tests passing.
If you encounter any issues, don’t hesitate to revisit the demo or check out the complete solution code in the Compressed Notes repository.
Challenge
Now that you’ve got the hang of state compression, it’s time to add a new feature to the Compressed Notes program. Your task is to implement an instruction that allows users to delete an existing note. Keep in mind that you can’t physically remove a leaf from the Merkle tree, so you’ll need to come up with a method to signify that a note has been deleted.
Good luck, and happy coding!
For a straightforward example of how to implement a delete function, check out
the
main
branch on GitHub.
Push your code to GitHub and let us know what you think of this lesson!