A chess game is exported in the form of a PGN file. This blog post explains the algorithm that helps in compressing a PGN file so that you can store it in lesser space. As explained in the first part of this blog post series “How to write a cost-effective database for chess from scratch”, we have to consider space limitations and hence compressing all files that we store becomes imperative.
Part 1: How to write a cost-effective database for chess from scratch
Part 2: Compressing a PGN File
Part 3: Compressing FEN with Moves and Game Details
This blog post explains the Compression of PGN file. This algorithm gives almost 50% compression ratio. If you want to dive directly into code then click here.
Before we begin we need to understand two things
- Format of a PGN file
- Items in the PGN file
Let's discuss this one by one
Format of a PGN File
Below is the format of a standard PGN file
key and values are metadata of the game like event details, players' details, etc
Items in the PGN file
Below are the items in PGN file
- Game's Metadata
- Moves in SAN (short algebraic notation) format
- Move number
- Result of the game
This compression algorithm compresses only moves, moves numbers, and the result of the game and leaves the metadata as it is, If you want you may choose to compress those as well to achieve an even better compression ratio.
But before compression begins we need to parse a PGN file. I am not explaining the parsing of PGN files in this post, if you want you can check the code to parse PGN file in java
To compress a move we need to understand how a move is written SAN is structured. A SAN contains the below informations in the given order
- The piece which is moving (not needed for pawns) or Castle
- Rank or File of the original square (Must for pawn and only needed for other pieces if two or more pieces of the same type can reach the move's destination square)
- Does the move captures the opponent's piece or not (Needed only when there's a capture)
- Destination square (Mandatory)
- A promotion symbol if the move promotes a pawn to another piece (Needed only when the move promotes a pawn to any piece)
- Promoted piece if the piece is promoted (Needed only when move promotes a pawn to any piece)
- A symbol denoting if it's a Check or Mate (Needed only when there's a check or Mate)
As you can see much of the above-mentioned informations are optional. Each of the above-mentioned 7 informations are denoted using a symbol. In a plain PGN file, these symbols are written in Characters which takes 1 byte each but since these symbols are limited we can denote them in bits as per their number of possibilities.
For example, the result has only 3 possibilities 1-0,0-1, or ½-½. So we can write the result in just 3 bits but as we can only write a full byte, we will pad the rest of the 5 bits and any result can be written using just 1 byte instead of 3 bytes for black or white's win and 7 bytes for a draw. It itself saves more than 50% of space.
This is basically a type of encoding for SAN. Let's design the encoding for each part
Encoding a chess Piece:
There are six kinds of pieces in chess:
- Pawn
- Rook
- Knight
- Bishop
- Queen
- King
Since can understand the color of the piece from the move order, we do not need to store color information of the piece and 6 values can be written in 3 bits. So piece will take only 3 bits
- Pawn - 000
- Rook - 001
- Knight - 010
- Bishop - 011
- Queen - 100
- King - 101
Encoding Rank or File of the chessboard
The rank of chess board is denoted by the number of rows which goes from 1-8 and the file is denoted by an alphabet assigned to each column that goes from a-h so in total there are 16 possibilities and we can write 16 numbers in 4 bits
- 1- 0000
- 2 - 0001
- 8 - 1000
- a - 1001
- b - 1010
- h - 1111
So each file or rank will take 4 bits and hence starting position will take only 4 bits.
Capture or No Capture
One bit, If it's true, then it's a move that captures another piece otherwise not.
Destination square
Destination squares can be any one among 64 squares, 64 can be written using 6 bits (2^6 = 64). we Can also split these 6 bits into two parts, 3 for file rest 3 for rank as both are of just 8 types and 8 can be written in 3 bits.
Promotion, Check, and Mate
These are special operations that may appear in a move. These are of just 3 types but combinations are possible, as a move can deliver check with promotion so there are in total 5 types and will take 3 bits.
In programming terms below are properties
Above mentioned properties and space will be applicable when the move is in the memory. While writing it into a file, we will have to consider that we can only write a full byte, which is why we will define a lookup that'll have a byte corresponding to all kinds of tokens that we will have in a move.
What Can be considered as a token in a move
Let's take a move eg: Nxe6+
Here tokens are:
- Nx (Knight Takes, it would be only N if there's no capture, meaning in the case of Ne6+)
- e6 (Target square)
- + (Operation)
Take another example: axb8=Q+ (a takes on b8 with promotion to the queen and delivering a check to the king)
Here tokens are:
- ax (a takes, the moving piece is a pawn since no other piece is mentioned in the move)
- b8 (Target Square)
- + (Operation)
- = (Operation)
- Q (promoted piece)
For all such tokens, we will define a byte and write that into file. QAnd while decoding we will use the same lookup table to bring back the usual SAN. Below is a function that prepares that encoding.
Hence Nxe6+ which is 5 bytes when written as String and 3 bytes when encoded, similarly, axb8=Q+ will take 7 bytes when written as String but when encoded it will take 5 bytes.
Let's take a look at the preparation of this lookup
In a concise manner that's the algorithm to encode a move in such a way that it takes lesser space. You can take a look at the full source code of a TinyMove here
Further Scope of Compression:
There is definitely, further scope for compression. for example, we can define an order and write those bits in that order using padding bits to create full bytes.
for example:
exf6+ should take memory as given below
P => 4 bits (piece)
e => 4 bits (Starting square)
x => 1 bit ( Capture)
f => 3 bits ( target File)
6 => 3 bits ( target Rank)
+ => 3 bits (Operation)
Total: 18 Bits, minimum bits required for full bytes would be 24, hence we will need to pad these 18 bits with 6 bits and the final encoded value would take 24 bits (3 bytes) which is the same as our algorithm but for more complex moves it might take lesser space.
Constructing a Fully encoded PGN file
You can define a PGN file as a class that has the below properties
- Map for metadata
- Result
- List of moves
Moves can be written into encoded moves and the rest of the metadata can be written as a plain string and at the end result can be written using the very last byte of the file.
Take a look at the full code for compressing and decompressing the entire PGN file.
Leave a Comment: