In the previous post (or re-post) we discussed about the operations of Gluster Tiering feature. In this post we will discuss how we capture the heat(special metadata) of the files/data-objects in Gluster, so that data maintenance activities (which are expensive on CPU, Storage and Network) like tier-migrations(promotion-demotions) can be done intelligently. We will also discuss on the possible improvements that can be done on the existing system so that heat can be captured efficiently and accurately. But before diving into this, lets we will see existing system.
The current architecture of Gluster tiering is as follows. We have a client side translator called tier-xlator. (For more details on Gluster Translator or Xlator click here). This translator is responsible for multiplexing the IO into HOT or COLD Tier. And an thin application over this translator, we have the tier migration logic which functions as the Gluster Tier migrator, which does promotion and demotions.(NOTE: In Gluster, various data maintenance daemons, like DHT Rebalancer, AFR Self-heal, Tiering Migrator are modified clients, which are loaded with the necessary Xlator Graph dept). The Tiering Migrator, which runs on each file serving node queries each Gluster Local Brick heat store via libgfdb and does promotion and demotion of files appropriately. Each File Serving Process(Gluster Brick) has a new translator called Change Time Recorder(CTR) that intercepts the IO and records the required extra metadata (in our case heat) on to the heat store of the respective brick, via libgfdb.
Now we are not going to discuss the about the tier xlator or tier migrator in detail in this post. Rather we will discuss about the CTR Xlator, libgfdb and the Heat-Data-store.
When we look at the case of Data tiering, which is a kind of Data Maintenance task, thus as a toll on CPU, Storage and Network performance, should be done intelligently. Precision on selecting the right candidate for migration (promotion/demotion), makes data tiering more effective and efficient, as you will not be migrating undeserving candidates (files) to the wrong tier. Also data migration is activity which is done conservatively, i.e either less frequently with large numbers/bytes of files or frequently with small number/bytes of files, for very same reason stated above.
Also while capturing heat of the file we need to take advantage of the distributed nature of the file system(Gluster) so that we can have no single point of failure and proper load balancing. We don’t want the admin or user of the file system to be aware of such a heat-data-store, so that there is no maintenance (disaster recovery, service management etc) overhead.
So taking these factors (and some more) in consideration, the requirements for the heat store would be as follows,
- Should be well distributed : No single point of failure and has load balancing
- No separate process/service for the Admin to manage
- Should store various heat-metadata of the files i.e more versatility on heat configurations
- Should provide good querying capabilities to have precise query results
- Should record the heat-metadata quickly without a performance hit on IO.
When we explored for such a store, the one which fitted the bill was SQLITE3, as
- Its a library and not a database service, less hassles for Admin
- Its a library so can be embedded with Gluster Bricks, which provides for distribution
- Has good querying capabilities with SQL
- Gives precise query results
- Has decent Write/Record optimization
- Can store various kind of file-heat-meta.
Even though SQLITE3 looked attractive we wanted flexibility to load and try other data stores, so we introduced a library called libgfdb. The main purpose of libgfdb is to provide a layer of abstraction to the user, so that the library user is agnostic to the data store used below and use standard library API to insert, update and query heat-meta-data of the files from the data store.
Therefore the overall architecture of the solution looks like this. We have a new translator called Change Time Recorder(CTR) that is introduced in brick stack. The sole purpose of the translator is to do inserts/updates to the heat-data-store i.e sqlite3 db via libgfdb. The tier migration daemon that runs on every file serving node queries the lock brick’s heat-data-store via libgfdb and does migrations accordingly.
Now lets looks at the write performance optimizations that sqlite3 provides
PRAGMA page_size: Align page size to the OS page size
PRAGMA cache_size: Increased cache size
PRAGMA journal_mode: Change to Write ahead logging.
PRAGMA wal_autocheckpoint : Less often autocheck
PRAGMA synchronous : Set to NONE i.e no synchronous writes
PRAGMA auto_vacuum : Set to NONE (Garbage collection, this needs to be done periodically)
For more information on sqlite3 settings click here.
This is how the insert/update go into the sqlite3 database.
By looking at the setting you may wonder if we should worrying about crash consistency ? Actually no, because we have introduced a CTR Lookup heal feature which heal the heat-data store while lookups and IO. Thus the heat-data-store will be eventually consistent. And for the tiering case eventual consistency is just fine!
Even though we have optimized our sqlite3,which is our heat-data-store, we see performance issues due to the following
- Indexing in WAL/DB : Even though the write to the sqlite3 goes in sequential manner, there is consistency check of index/primary keys. So for a large database (even though this database just has a part of the files in the whole Gluster Namspace), the indexing takes a toll on performance.
- Synchronous recording of heat : The heat of the file is recorded synchronous to the IO. Slowing down the IO performance of bricks, due to latency in heat-data-store. This should be abstracted out.
- SQL Creates are expensive : Creates in sqlite3 are transactions i.e create needs to be ACID in nature. This takes a toll on the performance. Huge number of creates show better performance when they are batched together in transaction. Thus comes the need for batch processing of creates, which is absent today.
- Read-Modify-Update Expensive: Due to the WAL (Write Ahead Logging) read performance of sqlite3 is not great. This is fine for the queries which are executed periodically, as the time taken to query is relatively much much lesser than the time taken to migrate files(or Time taken for data maintenance on the query result is relatively huge compared to the query time). But this has a very bad impact when we have Read-Modify-Updates while capturing heat for example number of writes/IOps on a file in a specified period of time. Such updates should be made less often i.e instead of 10 incremental updates we can consolidate it to one as +10.
All the above rises a need for a caching solution in libgfdb which has the following,
- Removes the indexing part of the data store from the IO heat recording path
- Batch process inserts/updates
- Consolidate incremental updates.
- Asynchronously record heat
The solution would be IMeTaL i.e In Memory Transaction Logs.
In Memory Transaction Logs, or simply imetal, is an in-memory circular log buffer. The idea is to have the inserts/updates in imetal and when logical partitions of imetal called segments are filled up, batch process the inserts/updates asynchronously/parallel to the incoming inserts/updates to other segments and write to the database. The only lock contention that the IO threads will have is the fine grain lock of where to write in imetal, rather than the whole index of the data-store. In this way we are able to achieve the above requirements. Here is how imetal looks like.
Now lets look the detail design of imetal, by understanding the components of imetal,
- Log Record : Each record that is inserted in imetal
- Segment: Segments is logical partitioning of the log. Each segment will hold a fixed amount of Log Records.
- Locator: Locator is a consultant that will tell the IO thread where to insert the Log Record in the imetal or precisely speaking which location in which segment.
- Flush Queue : Full or Expired Segments go to the flush queue.
- Worker Thread Pool: This is a thread pool of worker threads that works on the segments that are places in the flush queue.
- Flush Function/Data-Structure: Flush function is the function that is called when a segment, which is nothing but a batch of Log Records needs to be flushed to the database.
- Flush Data-Structure : Before the flush we need to sort the Log Records for redundant entries and have only one entry per GFID. For this we need a data structure like balanced binary search tree – red-black tree (insertion and search of O(log(n))), that would help us sort the Log records. Once the Log records are sorted in the data structure without duplication, we traverse the flush data-structure and start flushing the Log records into the database.
Now imetal is not specific to libgfdb, it can be designed/implemented in a generic way, i.e consumers of imetal should be given the liberty to define their own log record, flush function and flush data structure, thus imetal will only providing the infra for batch processing. Also imetal should be implemented at libgfdb, above the different data-store plugins, in this way no matter what the data store, users can benefit from imetal.
Configurations in imetal could be,
- Size of segment : How many log records in a segments
- Size of the imetal : How many segments in imetal
- Thread pool size : How many worker threads
- On-Demand Flush : A mechanism to flush segments even though they are not full.
Potential issues with imetal could be,
- Loss of inserts/updates : If the imetal is full then the inserts/updates are lost. This may lead to loss of precision. But with the help of CTR lookup heal we heal the heat-data-store.
- Freshness of the query: Since we are not inserting/updating the datastore directly, we need to flush all segments on demand, before the query is fired.
imetal is not yet implemented and this could be a good place, for someone from the community, to pick this up as a mini project😉
There are few other performance/scale issues that sqlite3 has.
- Sqlite3 doesn’t scale well : Since sqlite3 stores data in a single file and thus it doesn’t scale well when we have huge number of records. The obvious solution would be shradding of the databases i.e having multiple sqite3 database files for a single brick, and do distribution based on GFID
- Garbage collection: We have turn auto garbage collection off, to improve performance, but this can backfire when the database grows as there will be a lot of sparse spaces in the databases. Period Garbage collection is needed.
Well to summarize things, tiering is a very tricky feature and it takes time to perfect and fine tune it. With CTR/Libgfdb we have tried to give good amount of tunables and plugins, so that new things can be tried and we get a optimal solution.
Signing off for now!🙂