The HRT Beat | Tech Blog

Enabling Algo Research on Blockchain Data
Written
Topic
Published

Dec 2, 2021

Introduction

Strategy development at HRT is a complex process, and in some cases involves exploring massive data sets to find interesting patterns and generate ideas. In the world of cryptocurrency, an abundance of information is readily available on various blockchains, but it’s not easy to perform analytical research on this data. 

Iwona Kotlarska, a Summer 2021 HRT Core intern, helped HRT tackle this problem for Ethereum data by structuring it into a format that could be queried using a distributed query engine. This engine supports conventional SQL-like queries, which provide an interface suited to analytical research.

Our intern experience involves direct exposure to real-world engineering problems, which often involve practical challenges that require creativity and persistence to tackle, and this was no exception. Iwona exemplified those qualities on her road to a successful and useful project – here’s her story of her experience.

– Kevin Rawlings

We are sitting on the terrace on the 76th floor eating lunch, some interns and some fulltimers, when one of the fulltimers asks us what the selling point of our intern project is. When my turn to answer comes, I ask him whether he is an Algo Developer (a role at HRT which often performs complex analytical research). He confirms. “You’re gonna be able to do research on Ethereum blockchain data,” I tell him.

– Iwona Kotlarska

Introduction

Ethereum is one of the most popular cryptocurrencies out there, second only to Bitcoin. One of its most distinctive features (when compared to Bitcoin) is that it offers the ability to execute  arbitrary code that is stored and verified in the blockchain, using a mechanism called “smart contracts.” Ethereum has the concept of accounts, which can send, receive and hold Ethereum. When accounts exchange Ethereum this is often referred to as a “transaction” or a “value-transfer.” Smart contracts are like automated user accounts that, upon interaction, execute their stored code. This code could result in additional transactions (both with other user or smart contract accounts). My initial goal was to record all of the transaction activity that occurs in the public Ethereum network in such a way that it would be easy for HRT’s researchers to perform analysis on the transaction data.  

There is already some public infrastructure available that enables simple queries on transaction data, such as “what are the transactions to that address.” Somewhat more complex queries are also supported, such as “how many transactions were there in the given time period.” However, when we want statistics about those transactions, the information simply isn’t there. We need something better.

Simple Things First

For my purposes, I decided to put the data in a format that would support SQL queries, which are familiar to most researchers, and fortunately there was a pre-existing solution for this at HRT. HRT has a trino cluster already up and running. Trino is a distributed query engine that can execute queries across physical partitions of data (typically on disk) in order to deliver lower latency queries.  After some discussion with my teammates, I decided to partition data by date – it seemed like a good compromise between “not having a ton of small partitions” and “not having the data in one massive blob.”

So for starters, I decided to log Ethereum transaction data, which can be retrieved from an Ethereum node in a simple table format. HRT is running its own OpenEthereum node , so I was able to query it using web3 With simple code like:

Python
```python
w3 = Web3(Web3.WebsocketProvider(OPENETH_NODE_ADDRESS))
block = w3.eth.get_block(BLOCK_NUM, full_transactions=True)
```

I could get all of the transactions in the block `BLOCK_NUMBER` in a Python AttributeDict, where the `transactions` field would just be a list of AttributeDicts:

Python
AttributeDict({'blockHash': HexBytes('0x103c76b8b8b378dccdf57b4601f5d2db3105cf4f499e612c1547ee7a2e8b5d1c'),
 'blockNumber': 9459694,
 'from': '0xAB83D182f3485cf1D6ccdd34C7CFEf95b4C08da4',
 'gas': 60000,
 'gasPrice': 70000000000,
 'hash': HexBytes('0xe082ca768add447ff81962e35788a0f2c048f7d1765829d7dd0808163e58a4a2'),
 'input': '0xa9059cbb000000000000000000000000a6df27fc9795e5fc2086994a898db6871b1379c20000000000000000000000000000000000000000000000000000000094f0a980',
 'nonce': 8081,
 'to': '0xdAC17F958D2ee523a2206206994597C13D831ec7',
 'transactionIndex': 0,
 'value': 0,
 'type': '0x0',
 'v': 37,
 'r': HexBytes('0x2e19f9c83b05e60cdb49bcc48de2a811ab670c8e741599ae3de754976baa3946'),
 's': HexBytes('0x405aee2f3d2f1c6d52731e7da5420137ceb6d6ef7161695bee606dcb64157a64')})

The table to put that data in was simple enough:

Python
  Column       Type Extra Comment
0        blocknumber     bigint
1              from_    varchar
2                gas     bigint
3           gasprice     double
4               hash  varbinary
5              input    varchar
6              nonce     bigint
7          timestamp     bigint
8                 to    varchar
9   transactionindex     bigint
10             value     double

This enables queries like “what is the largest transaction in a given time period” to be answered quite easily. But I also wanted to record all transactions interacting with a particular contract, which is difficult due to a concept Ethereum supports typically referred to as “internal transactions.”  

Internal Transactions

“Internal transactions” refers to value transfers that occur when one smart contract transacts with another. For example, a user could initiate a transaction with smart contract A, which in turn could result in code that executes a transaction with smart contract B. Only the “top level” transactions can be read from transaction metadata. If there are subsequent contract calls, such as smart contract A transacting with smart contract B in the above example, the details of each call are not saved by an Ethereum node. This doesn’t compromise how the Ethereum protocol works, but it does mean the data I had recorded thus far for analysis was incomplete.

After searching the internet and web3 docs, it turned out that there is no standard API to request a list of internal transactions, because they are not actually stored on the node, or in the blockchain. However, I encountered some StackOverflow questions that pointed to Geth ’s debug API, and one method available there in particular – `debug_traceTransaction.` It is not a standard method in web3, but you can still make requests, just using slightly different methods:

Python
python
w3_provider = Web3.WebsocketProvider(OPENETH_NODE_ADDRESS)
trace = w3_provider.make_request(
    "debug_traceTransaction",
    [TRANSACTION_HASH, {},],
)

This debug API can also return a trace of what the Ethereum Virtual Machine was doing during execution of code associated with that transaction, which is effectively the most detailed trace possible. Sounds great, except that as a Geth-specific debug API it is not available via OpenEtherium, so I had to run my own Geth node. Running a Geth node is simple enough if you have sufficient resources – download Go, download and compile Geth, and then run it. It automatically starts downloading the blockchain data and you can query it up to the point where it synced. That means I have to test on very old data to start with, but it should sync in roughly 11 hours using the Geth’s default fast sync. I then ran into another problem:

Python
{'jsonrpc': '2.0', 'error': {'code': -32601, 'message': 'Method not found'}, 'id': 6}

It turns out that with Geth’s fast sync, it does not keep full traces of older blocks, So, that meant doing a full sync, which takes significantly longer than a fast sync, and takes up far more resources. Once the node had caught up, I was able to begin processing the traces. 

The traces Geth provides are in the form of a list of full Ethereum Virtual Machine state snapshots at each consecutive step of execution. One full state can look like:

Python
{'pc': 295, 'op': 'CALLDATALOAD', 'gas': 297385, 'gasCost': 3, 'depth': 1, 'stack': ['0x230a07c', '0x12c', '0x4'], 'memory': ['0000000000000000000000000000000000000000000000000000000000000000', '0000000000000000000000000000000000000000000000000000000000000000', '0000000000000000000000000000000000000000000000000000000000000060']}

And the next state is:

Python
{'pc': 296, 'op': 'PUSH2', 'gas': 297382, 'gasCost': 3, 'depth': 1, 'stack': ['0x230a07c', '0x12c', '0x7c7ef64906a730c191346befa970aeadb147dce928c2e82806b20d2f14a7e4f'], 'memory': ['0000000000000000000000000000000000000000000000000000000000000000', '0000000000000000000000000000000000000000000000000000000000000000', '0000000000000000000000000000000000000000000000000000000000000060']}

To decipher what is going on here, I needed to take a look at a few of the fields. EVM is a stack machine with additional memory, so those two fields are separate. Field `pc` means program counter, and `op` is the opcode. Depth indicates how many nested calls have happened – this is the perfect mechanism for tracing all of the possible entries into functions / returns from functions (opcodes alone are not sufficient for the latter because any opcode can return if an error was encountered). To trace internal transactions, I had to take a closer look at the opcodes and dig into ones that are various forms of function calls. There are a bunch of them – `CALL`, `CALLCODE`, `DELEGATECALL`, `STATICCALL`, `CREATE` and  `CREATE2.` The last two actually create new contracts, and also start executing them. A small but noteworthy difference between them is that `CREATE2` is deterministic, so it is possible to deploy a contract under the same address on a test network as was used on the public network, which is a nice piece of engineering worth appreciating.

Since EVM is a stack machine, arguments of those calls are on the stack, and should be read from there. According to the specification, on the stack there is an offset and length of an argument in memory, which also needs to be read. EVM is a big endian machine, unlike x86. Using all of this, I was able to get the actual input of the function, which produces a table that looks like:

Python
         Column     Type          Extra Comment
0        calltype  varchar
1           depth   bigint
2          failed  boolean
3           from_  varchar
4             gas   double
5        gasprice   double
6           input  varchar
7       parenttxn  varchar
8   returnedvalue  varchar
9              to  varchar
10          value   double
11           date  varchar  partition key

Now I could trace all the transactions that interact with a particular address, and also obtain some statistics about those transactions.

Note: Geth’s JavaScript tracer feature is probably a better way to extract this information, in hindsight. 

After processing the traces, it was important to test the new features properly. I may have missed some of the important opcodes (I actually did, but looking at the `depth` field helped to catch that), or I may have made some wrong assumptions. So to further convince myself that the code I’ve written is correct, I wanted to run it on a lot of data. That meant leaving it overnight – web3 calls are not the fastest thing out there.

I expected to come back to some logs about wrong assumptions, or maybe to some other exceptions in my Python code. Instead I found my Geth node to have been killed by the OS. Why had the Geth process been killed? A quick look at `dmesg` provides the answer:

Python
[5716791.967165] Out of memory: Kill process 20977 (geth) score 795 or sacrifice child
[5716791.969437] Killed process 20977 (geth) total-vm:83614416kB, anon-rss:77619548kB, file-rss:0kB, shmem-rss:0kB
[5716796.074808] oom_reaper: reaped process 20977 (geth), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Why does Geth run out of memory?

To answer that question I needed to be able to consistently replicate the behaviour. It turned out it is enough to spam the node with sufficiently many `debug_traceTransaction` requests and it dies. Geth is written in go, which has a garbage collector, so it couldn’t be due to memory leaks in the implementation. Slowing down the rate at which requests were made seemed to change how quickly the node died, but it still died. So it was time to look for another alternative.

OpenEthereum doesn’t support `debug_traceTransaction,` but maybe it has something similar? It turns out the answer is “yes” – in their debug API there is trace_replayTransaction which should provide the same core functionality I required of `debug_traceTransaction.` It doesn’t provide full EVM state snapshots but it looked like the data from OpenEthereum node would be sufficient.

However, it turned out that Geth node is still needed for one very specific thing: to check whether a transaction before Byzantium hard fork succeeded (in theory, one could check whether the transaction used all the gas available, but that is supposedly possible to do in a transaction that succeeded as well, so this method could result in false positives). So that’s how I ended up with a setup that requires both types of nodes, OpenEthereum and Geth. 

In Search of Even More Data

Now that I was able to extract both public and internal transactions data, I began to search for more data that could be mined from running transactions on my node with tracing enabled. It turns out there is more interesting data available: `LOG*` opcodes. Those opcodes allow to contracts to emit *events*, something that contract developers can use to keep track of what is going on in the contract call. Decentralized exchanges like Uniswap or Sushiswap use these events during their operation, and it is possible to glean rich information about the exchange’s activity by tracing the `LOG*` opcodes. As an example, I was able to trace some `LOG*` opcodes from my node that represent the signature of a Solidity event that looked like this:

Python
 event Swap(
        address indexed sender,
        uint amount0In,
        uint amount1In,
        uint amount0Out,
        uint amount1Out,
        address indexed to
    );

Given source code of a contract (and aforementioned DEXes are open source), I could fully decipher, for example, what swaps were performed. This data can be extremely useful for trading.

And there is more. Before transactions are mined, they are in something called mempool, and it just so happened that another intern was making sure mempool data was available to other HRT applications. As a result, it was possible to also include that data in Trino in a fairly straightforward table:

Python
                 Column            Type          Extra Comment
0                  nonce          bigint
1                    gas          bigint
2               gasprice          double
3                  value          double
4                   from         varchar
5                     to         varchar
6                   hash         varchar
7                chainid          bigint
8              sourceloc         varchar
9                   data         varchar
10             timestamp          double
11  maxpriorityfeepergas          double
12          maxfeepergas          double
13            accesslist  array(varchar)
14                  date         varchar  partition key

As you can see, this table has a bit more information than the original one for confirmed transactions. This is because of the London hardfork, which happened around the time I was working on this project. So, as a last small touch, I was able to add those fields to the information about regular, mined transactions as well.

During my internship I was able to work on a very open-ended project (What tables would be there? How to split the data? How to validate its usefulness?), that required interacting with a myriad of technologies: OpenEthereum, Trino, Geth, etc. It also required me to understand what is currently going on with Ethereum, such as the London hardfork, and involved very interesting low-level work like the Ethereum Virtual Machine. Finally, this was real work that had a lot of potential impact on HRT’s future trading efforts. 

Conclusion

Iwona’s experience is a great end-to-end example of working through an open-ended engineering problem. It required navigating the intersection of multiple technologies, diving deep into system internals, learning the history of involved systems, and solving practical problems.

We are now in the process of fully productionizing and expanding the database Iwona created, and it has proven immediately useful to our researchers! 

– Kevin Rawlings

Don't Miss a Beat

Follow us here for the latest in engineering, mathematics, and automation at HRT.