Expandable Storage Engine ( ESE ), also known as JET Blue , is the ISAM (indexed sequential accessed) data storage technology from Microsoft. ESE is the core of Microsoft Exchange Server, Active Directory, Branch Cache, and Windows Search. It is also used by a number of Windows components including Windows Update client and Help and Support Center. The goal is to allow applications to store and retrieve data through indexed and sequential access.
ESE provides updates and retrieval of transacted data. A damage recovery mechanism is provided so that data consistency is maintained even in the event of a system crash. Transactions in ESE are very concurrent making ESE suitable for server applications. ESE stores data intelligently to ensure high-performance access to data. In addition, the ESE is lightweight so it is suitable for additional applications.
The ESE Runtime (ESENT.DLL) has been sent on every Windows release since Windows 2000, with the original x64 version of the ESE runtime delivery with x64 versions of Windows XP and Windows Server 2003. Microsoft Exchange, to Exchange 2003 is shipped with only 32-bit edition, because it is the only supported platform. With Exchange 2007, it ships with a 64-bit edition.
Video Extensible Storage Engine
Database
A database is a grouping of data physically and logically. The ESE database looks like a single file to Windows. Internally the database is a collection of 2, 4, 8, 16, or 32 KB pages (16 and 32 KB page options available only in Windows 7 and Exchange 2010), set in a balanced B-tree structure. These pages contain meta-data to describe the data contained in the database, the data itself, the index to hold orders pulling data, and other information. This information is mixed in the database file but efforts are made to keep the shared data collected together in the database. The ESE database can contain up to 2 32 pages, or 16 terabytes of data, for pages of 8 kilobytes.
The ESE database is organized into several groups called instances. Most applications use one instance, but all applications can also use multiple instances. The importance of an instance is to connect a series of single recovery logs with one or more databases. Currently, up to 6 user data bases can be attached to an ESE instance at any time. Any separate process that uses the ESE may have up to 1024 ESE events.
The database is portable because it can be removed from one ESE instance that is running and then attached to the same instance or running differently. While separate, the database can be copied using the standard Windows utility. The database can not be copied while it is being actively used because the ESE opens the database file exclusively. The database may be physically located on any device that is supported for I/O operations that can be directly addressed by Windows.
Maps Extensible Storage Engine
Tables
A table is a collection of homogeneous records, where each record has the same set of columns. Each table is identified by a table name, whose scope is local to the database on which the table is contained. The amount of disk space allocated to the table in the database is determined by the parameters given when the table is created with the CreateTable operation. Tables grow automatically in response to data generation.
The table has one or more indexes. There should be at least one clustered index for the recording data. When no clustered index is defined by the application, an artificial index is used that orders and cluster records by chronological order of insertion of records. Indexes are defined to retain interesting data orders, and allow both consecutive access to records in the index order, and direct access to records based on the index column values. The clustered index in ESE must also be primary, meaning that the index key must be unique.
Cluster and non-clustered indexes are represented using B trees. If insert or update operations cause pages to overflow, pages are shared: new pages are allocated and logically chained between two previous adjacent pages. Since this new page is not physically adjacent to its logical neighbors, access to it is not as efficient. ESE has an on-line compacting feature that resolves data. If the table is expected to be updated frequently, space may be reserved for future insertion by determining the correct page density when creating tables or indexes. This allows split operations to be avoided or delayed.
Recordings and columns
A record is a collection of associated column values. Records are inserted and updated via the Update operation and can be deleted via Delete operation. Columns arranged and retrieved via SetColumns and RetrieveColumns operations, respectively. The maximum size of the recording is 8110 bytes for 8 kilobytes pages with the exception of the long value column. LongText and LongBinary column types do not contribute significantly to this size restriction, and records can store data much larger than the size of the database page when data is stored in long value columns. When long reference values ââare stored in records, only 9 bytes of recorded data are required. These long values ââalone can be up to 2 gigabytes (GB) in size.
Records are usually uniform because each record has a set of values ââfor the same set of columns. In the ESE, it is also possible to define many columns for a table, and to have a specific record contains only a small number of non-NULL column values. In this sense, tables can also be a collection of heterogeneous records.
ESE supports multiple column values, ranging from 1 bit to 2 GB. Choosing the correct column type is important because the column type determines many of its properties, including the ordering for the index. The following data types are supported by ESE:
column type
Fixed columns, variables, and tags
Each ESE table can specify up to 127 fixed column lengths, 128 column length variables, and 64,993 tag columns.
- A fixed column is basically a column that takes the same amount of space on each record, regardless of its value. The column retains the 1-bit to represent the NULLity of the column value and the number of fixed spaces in each record in which the column, or fixed columns specified later, is specified.
- The variable column is basically a column that takes the amount of space that varies in each record in which they are assigned, depending on the size of a particular column value. The variable column takes 2 bytes to specify the NULLity and size, and the amount of space that varies in each record in which the column is set.
- Tagged columns are columns that require no space at all if not set in the recording. They can be judged single but can also be worth a lot. The same tagged column may have multiple values ââin a single recording. When the tagged column is set in the record, each instance of the marked column requires about 4-byte space in addition to the size of the tagged column value. When the number of instances of a single large-tagged column, the overhead cost per instance of a tagged column is about 2-bytes. Tagged columns are ideal for sparse columns because there is no space at all if not set. If the multi-value tag is indexed, the index will contain one entry for the record for each tagged column value.
For certain tables, columns fall into one of two categories: columns that appear exactly once in each recording, with the possibility of multiple NULL values; and which are rare, or that may have multiple occurrences in a single note. Fixed columns and variables belonging to the previous category, while the last tagged column belongs to the last. Internal representation of two different column categories, and it is important to understand the trade off between column categories. Fixed and variable columns are usually represented in every recording, even when the event has a NULL value. This column can be quickly handled through the offset table. Marking the marked columns preceded by the identifier of these columns and columns lies with binary search sets of tags columns.
Long value
Long Text and Long Binary column types are large binary objects. They are stored in a B tree separate from a clustered index that is locked by a long value id and a byte offset. ESE supports append, override byte range, and set the size for these columns. Additionally, the ESE has a single instance shop feature where multiple records can refer to the same large binary object, as if each record has its own copy of the information, ie without an inter-note locking conflict. The maximum size of the Long Text or Binary Length column is 2 GB.
Version, auto-increment and escrow
The version column is automatically added by ESE whenever the record containing this column is modified through the Update operation. This column can not be set by app, but it can only be read. The application of the included version column is used to determine whether the in-memory copy of a given recording needs to be refreshed. If the value in the table record is greater than the value in the cached copy, the cached copy is known to be out of date. The version column should be of type Long.
The auto increment column is automatically assigned by the ESE in such a way that the value contained in the column is unique for each record in the table. This column, like the version column, can not be set by the app. Auto upgrading columns can only be read, and are automatically set when new records are inserted into the table through the Update operation. The value in the column remains constant for the recording age, and only one auto-upgraded column is allowed per table. The auto increment column can be either Long Eye or Currency type.
The escrow field can be modified via EscrowUpdate operation. The Escrowed update is a numerical delta operation. The escrow column must be of type Long. Examples of numerical delta operations include adding 2 to the value or subtracting 1 from the value. ESE keeps track of changes in value rather than the final value of updates. Some sessions may each have extraordinary changes made through EscrowUpdate to the same value because the ESE can determine the actual final value regardless of which transactions are performed and which transaction decline. This allows many users to update columns simultaneously by making numeric delta changes. Optionally, the database engine can delete records with a zero value from the column. Common use for the escrow column is a reference counter: many add threads/value reductions without keys, and when the counter reaches zero, records are automatically deleted.
Index
Index is a continuous record ordering in a table. Indexes are used for both sequential access to rows in the specified order, and for direct recording navigation based on the values ââof the indexed columns. The order defined by the index is represented in the form of an array of columns, in order of priority. This column arrangement is also called index key. Each column is called an index segment. Each segment of the index may rise or fall, in terms of its booking contribution. A number of indices can be specified for a table. ESE provides a rich set of indexing features.
Grouped index
One index can be defined as a grouped, or primary, index. In the ESE, the index grouped should be unique and referred to as the main index. Other indices are described as non-clustered, or secondary, indexes. The main index differs from the secondary index in the index entry is the record itself, and not a logical pointer to the record. Secondary indexes have a primary key in their leaves to logically connect to records in the primary index. In other words, tables are physically grouped in the order of primary indexes. The retrieval of non-index records data in the primary index order is generally much faster than in the secondary index sequence. This is because a single disk access can bring into memory some records that will access nearby in time. The same disk access fulfills multiple recording access operations. However, insertion of records into the middle of the index, as determined by the main index sequence, may be much slower than adding it to the end of the index. Update frequencies should be carefully considered against retrieval patterns when performing table design. If no main index is specified for the table, then the main index is implicit, called the key database index (DBK) created. DBK is just a unique rising number that increases each time a record is entered. Consequently, the physical sequence of records in the DBK index is a sequence of chronological insertion, and new records are always added at the end of the table. If the app wants to group data on a non-unique index, this may be by adding an autoincrement column to the end of the non-unique index definition.
Indexing more high value columns
An index can be defined above a multi-value column. Multiple entries may exist in this index for records with multiple values ââfor indexed columns. Multi-value columns can be indexed together with single-valued columns. When two or more multi-value columns are indexed together, then the multi-value property is only respected for the first multi-value column in the index. Lower preferred columns are treated as if they are single-valued.
Sparse index
An index can also be defined as sparse. The index rarely does not have at least one entry for each record in the table. There are a number of options in determining rare indexes. The option exists to exclude records from the index when the entire index key is NULL, when the main segment is NULL or when only the first key segment is NULL. An index can also have a conditional column. This column never appears in the index but may cause the record not to be indexed when the conditional column is NULL or non-NULL.
Tuple Index
An index can also be defined to include one entry for each sub-string of a Text or Long Text field. This index is called the tuple index. They are used to speed up queries with predicate matching sub-strings. The Tuple index can only be specified for the Text field. For example, if the Text column value is "I like JET Blue" , and the index is configured to have a tuple size of at least 4 characters and a maximum tuple length of 10 characters, then the following sub-index will be indexed:
Although the tuple index can be very large, they can significantly speed up form requests: find all records that contain "JET Blue" . They can be used for sub-strings longer than the maximum tuple length by dividing the search sub-string into the maximum tuple length search string and bypassing the result. They can be used for exact matching for strings over the maximum tuple length or as short as minimum tuple length, without index junctions. For more information about performing index junctions on the ESE, see Intersection Index. The Tuple index can not speed up queries where the search string is shorter than the minimum tuple length.
Transactions
Transactions are logical units of processing restricted by BeginTransaction and CommitTransaction, or Rollback, operations. All updates made during the transaction are atomic; they all appear in the database at the same time or nothing appears. Subsequent updates by other transactions are not visible for transactions. However, transactions may update only data that has not changed temporarily; other operations fail at once without waiting. The read-only transaction does not need to wait, and updating the transaction may interfere with just one more update transaction. Transactions terminated by Rollback, or by system crash, leave no trace on the database. In general, the data status is restored on Rollback as before BeginTransaction.
Transactions can be nested up to 7 levels, with one additional level reserved for internal use of ESE. This means that some transactions may be canceled, without the need to return all transactions; a CommitTransaction of multilevel transactions only signifies the success of a single phase of processing, and outside transactions may not have failed. Changes are committed to the database only when the outermost transactions are made. This is known as committing to transaction level 0. When a transaction is committed to a level 0 transaction, the data explaining the transaction is synchronously watered to the log to ensure that the transaction will be completed even in the event of a subsequent system crash. Syncing logs in sync makes ESE transactions durable. However, in some cases the app wants to order their updates, but it does not immediately guarantee that changes will be made. Here, the app can make changes with JET_bitIndexLazyFlush.
ESE supports a so-called multi-version concurrency control mechanism. In multi-versioning, each transaction requires a consistent view of the entire database as it was when the transaction started. The only updates he found were made by him. In this way, each transaction operates as if it were the only active transaction that runs on the system, except in case of write conflict. Because a transaction can make changes based on reading updated data in other transactions, multi-version by itself does not guarantee a uniform transaction. However, serializability can be achieved when desired by simply using an explicit read-note key to lock reading data that is based on an update. The read and write keys can be explicitly requested with the GetLock operation.
In addition, an advanced concurrency control feature known as escrow locking is supported by ESE. Escrow Locking is a very concurrent update in which numerical values ââare changed in a relative way, that is by adding or subtracting other numerical values. The Escrow update does not conflict even with other concurrent escrow updates to the same datum. This is possible because supported operations can be run and can be independently performed or canceled. As a result, they do not interrupt the concurrent update transaction. This feature is often used for aggregation that is maintained.
ESE also expands the transaction semantics from data manipulation operations to data definition operations. It is possible to add an index to a table and simultaneously run a transaction updating the same table without any transaction key conflict. Then, when this transaction is completed, the newly created index is available for all transactions and has an entry for update notes created by other transactions that can not see the existence of the index when the update occurs. The operation of the data definition can be done with all the expected features of the transaction mechanism for update records. Supported data definition operations in this mode include AddColumn, DeleteColumn, CreateIndex, DeleteIndex, CreateTable, and DeleteTable.
The cursor is the logical pointer in the table index. The cursor can be positioned on the recording, before the first recording, after the last recording or even between recordings. If the cursor is positioned before or after recording, there is no record at this time. It is possible to have multiple cursors into the same table index. Many records and column operations are based on the cursor position. Cursor position can be moved sequentially by Move operation or directly use index button by searching operation. The cursor can also be moved to a fractional position in the index. In this way, the cursor can be quickly moved to the thumb position. This operation is done at the same speed as the search operation. No intervention data should be accessed.
Each cursor has a copy buffer to create a new recording, or modify existing records, column by column. This is an internal buffer whose content can be changed by SetColumns operation. The modification of the copy buffer does not automatically change the stored data. The contents of the current record can be copied to the copy buffer using the PrepareUpdate operation, and Update operation saves the contents of the copy buffer as a record. A copy of the buffer is implicitly cleared on the transaction commit or rollback, as well as on the navigation operation. RetrieveColumns can be used to retrieve column data either from records or from a copy buffer, if any.
Query processing
ESE applications always query their data. This section of the document describes the features and techniques for the application to write query process logic on the ESE.
Temporary types and tables
ESE provides sorting capabilities in the form of temporary tables. This app inserts a record of data into the process of sorting one note at a time, and then taking it one note at a time in sorted order. The actual sorting is done between the insertion of the last note and the first recording. Temporary tables can be used for partial and complete result sets as well. This table can offer the same features as the base table including the ability to navigate sequentially or directly to a row using an index key that matches such a definition. Temporary tables can also be updated for complex aggregate calculations. Simple aggregates can be calculated automatically with features similar to sorting where the desired aggregate is a natural result of the sorting process.
Cover index
Taking direct column data from the secondary index is an important performance optimizer. The columns can be retrieved directly from the secondary index, without accessing the data record, via the RetrieveFromIndex flag on the RetrieveColumns operation. Much more efficient to retrieve columns than a secondary index, rather than from records, when navigating with an index. If column data is taken from a recording, additional navigation is required to find records with the primary key. This may cause additional disk access. When the index provides all the required columns then it is called the closing index. Note that the columns defined in the main index table are also found in the secondary index and can be retrieved using JET_bitRetrieveFromPrimaryBookmark.
Index keys are stored in normal form which can, in most cases, be normalized to the original column values. Normalization is not always reversible. For example, the Type Text and Text Length columns can not be normalized. In addition, index keys can be truncated when column data is very long. In cases where columns can not be taken directly from the secondary index, records are always accessible to retrieve the required data.
Intersection index
Frequently asked questions involve combinations of restrictions on the data. An efficient way to process the restrictions is to use the available index. However, if the request involves some restrictions then the application often processes the restrictions by running on the complete index range of the most limited predicate filled by one index. Each remaining predicate, called the remaining predicate, is processed by applying the predicate to the record itself. This is a simple but potentially misfortune method that has to do a lot of disk access to bring notes into memory to apply the remaining predicate.
The index junction is an important query mechanism in which multiple indexes are used together to more efficiently process complex restrictions. Instead of using only a single index, index ranges on multiple indices are combined to produce a much smaller record number in which the remaining predicates can be applied. ESE makes this easy by providing IntersectIndexes operations. This operation receives a series of index ranges on the index of the same table and returns a temporary table of primary keys that can be used to navigate to basic table records that satisfy all index predicates.
Pre-join table
Combined is a common operation on a normalized table design, where logically related data is brought back together for use in the application. Joining can be a costly operation as much data access may be required to bring related data into memory. This effort can be optimized in some cases by defining a base table containing data for two or more logical tables. The column set from the base table is a composite of the column set of these logical tables. Tagged columns allow this because of good handling of multi-value and sparse data. Because the associated data is stored together in the same record, the data is accessed together so as to minimize the number of disk accesses to merge. This process can be extended to a large number of logical tables because ESE can support up to 64,993 tagged columns. Since the index can be defined over multi-valued columns, it is still possible to index the 'interior' table. However, some limitations exist and applications should carefully consider pre-joining before using this technique.
Bad recording and recovery
ESE recording and recovery features support the integrity and consistency of data is assured in case of system crash. Logging is the process of redunding the recording of database update operations in a log file. The log file structure is very strong against system crashes. Recovery is the process of using this log to restore the database to a consistent state after a system crash.
The transaction operation is logged and the log is poured onto disk during each commit to transaction level 0. This allows the recovery process to repeat updates made by transactions that perform level 0 transactions, and cancels changes made by transactions that do not transact 0. This type of recovery scheme often referred to as a 'roll-forward/roll-backward' recovery scheme. Logs can be stored until the data is copied safely through the backup process described below, or logs can be reused in a circular way as soon as they are no longer required for recovery from system malfunction. Circular logging minimizes the amount of disk space required for logs but has implications for the ability to recreate the status of data in case of a media failure.
Backup and restore
Logging and recovery also play a role in protecting data from media failures. ESE supports on-line backup where one or more data bases are copied, along with log files in a way that does not affect database operations. The database can be questioned and updated as backup is being generated. This backup is referred to as 'fuzzy backup' because the recovery process should run as part of a backup recovery to restore a consistent database set. Good backup of streaming and shadow copy is supported.
Backup streaming is a backup method where copies of all desired database files and necessary log files are created during the backup process. Copies of files can be saved directly to tape or can be made to other storage devices. No need to perform any activity with backed up backup. The database and log files are fully checked to ensure there is no data damage in the data set during the backup process. Streaming backup can also be an additional backup. Incremental backup is the only one where only log files are copied and which can be recovered along with the previous full backup to bring all databases to a recent state.
A copy of backup is a new high-speed backup method. The backup copy is dramatically faster because the copy is actually created after a short period of application retirement. When the next update is made to the data, the virtual copy manifests itself. In some cases, hardware support for backup shadow copy means that actually storing virtual copy is not required. Shadow backup copy is always full backup.
The restore can be used to implement one backup, or it can be used to apply a combination of one complete backup with one or more additional backups. Furthermore, any existing log files can be played back as well to recreate the entire data set all the way up to the last transaction recorded as a commitment to the transaction level of 0. The backup recovery can be done to any system capable of supporting native applications. No need the same machine, or even the same machine configuration. The file location can be changed as part of the recovery process.
Backup and restore to different hardware
When the ESENT database is created, the size of the physical disk sector is stored with the database. The size of the physical sector is expected to remain consistent between sessions; otherwise, errors are reported. When a physical drive is cloned or recovered from an image drive to a drive that uses a different size of the physical sector (Advanced Format Drive), ESENT will report an error.
This is a known issue and Microsoft has hot fixes available. For Windows Vista or Windows Server 2008 see KB2470478. For Windows 7 or Windows Server 2008 R2, see KB982018.
History
JET Blue was originally developed by Microsoft as a prospective upgrade for the JET Red database engine in Microsoft Access, but was never used in this role. Instead, it is then used by Exchange Server, Active Directory, File Replication Service (FRS), Security Configuration Editor, Certificate Service, Windows Internet Name Service (WINS) and a number of other Microsoft Windows services, applications and Windows components. Over the years, it's a personal API used by Microsoft only, but has since become a published API that anyone can use.
Work began on Data Access Engine (DAE) in March 1989 when Allen Reiter joined Microsoft. Over the next year teams of four developers worked for Allen to complete most of ISAM. Microsoft already has BC7 ISAM (JET Red) but starts the Data Access Engine (DAE) effort to build a stronger database engine as an entry in the new client-server architecture world. In the spring of 1990, the BC7 team of ISAM and DAE merged to become Joint Engine Technology (JET) efforts; is responsible for producing two v1 (JET Red) and v2 (JET Blue) engines that will conform to the same API specification (JET API). DAE becomes JET Blue for the colors of the Israeli flag. BC7 ISAM becomes JET Red for the color of the Russian flag. While JET Blue and JET Red are written for the same API specification, they do not share the ISAM code at all. They both support the general request processor, QJET, which then along with BC7 ISAM becomes synonymous with JET Red.
JET Blue was first shipped in 1994 as ISAM for WINS, DHCP, and RPL services are now dead in Windows NT 3.5. It was shipped again as a storage engine for Microsoft Exchange in 1996. Additional Windows services chose JET Blue as their storage technology and in 2000 every version of Windows began shipping with JET Blue. JET Blue is used by Active Directory and is part of a special set of Windows code called Trusted Computing Base (TCB). The number of Microsoft applications using JET Blue continued to grow and the JET Blue API was published in 2005 to facilitate the use of ever-increasing number of applications and services both inside and outside of Windows.
The Microsoft Web Exchange Blog entry states that developers who have contributed to JET Blue include Cheen Liao, Stephen Hecht, Matthew Bellew, Ian Jose, Edward "Eddie" Gilbert, Kenneth Kin Lum, Balasubramanian Sriram, Jonathan Liem, Andrew Goodsell, Laurion Burchall, Andrei Marinescu, Adam Foxman, Ivan Trindev, Spencer Low, and Brett Shirley.
Comparison to JET Red
While they share the same lineage, there is a big difference between JET Red and ESE.
- JET Red is a file sharing technology, while ESE is designed to be embedded in server applications, and does not share files.
- JET Red makes file recovery a best effort, while ESE has written a forward listing and snapshot isolation for recovery of guaranteed damage.
- JET Red before version 4.0 only supports page-level locking, while ESE and JET Red version 4.0 support record level locking.
- JET Red supports a variety of query interfaces, including ODBC and OLE DB. ESEs are not delivered with request engines but depend on the app to write their own queries as ISAM C code.
- JET Red has a maximum file size of 2 GiB, whereas ESE has a maximum database size of 8 TiB with 4 KiB pages, and 16 TiB with 8 KiB pages.
References
External links
- ManagedEsent - interfaces managed by.NET
- ESENT Serialization - an object persistence framework for.NET, based on ManagedEsent.
- [1] - Library and tools for accessing the Extensive Storage Database (ESE) (EDB) format.
- RavenDB - NoSQL Document Database built on ESENT.
- ESEDatabaseView - Utility for viewing ESE database
Source of the article : Wikipedia