Compressing a PGN file| Scriptchess

A chess game is exported in the form of a PGN file. This blog post explains the algorithm that helps in compressing a PGN file so that you can store it in lesser space. As explained in the first part of this blog post series “How to write a cost-effective database for chess from scratch”, we have to consider space limitations and hence compressing all files that we store becomes imperative.

This blog post explains the Compression of PGN file. This algorithm gives almost 50% compression ratio. If you want to dive directly into code then click here.

Before we begin we need to understand two things

Format of a PGN file
Items in the PGN file

Let's discuss this one by one

Format of a PGN File

Below is the format of a standard PGN file

[key value]

[key value]

[key value]



1.e4 e5 2. Nf3 Nc6 ….. 1/0

key and values are metadata of the game like event details, players' details, etc

Items in the PGN file

Below are the items in PGN file

Game's Metadata
Moves in SAN (short algebraic notation) format
Move number
Result of the game

This compression algorithm compresses only moves, moves numbers, and the result of the game and leaves the metadata as it is, If you want you may choose to compress those as well to achieve an even better compression ratio.

But before compression begins we need to parse a PGN file. I am not explaining the parsing of PGN files in this post, if you want you can check the code to parse PGN file in java

To compress a move we need to understand how a move is written SAN is structured. A SAN contains the below informations in the given order

The piece which is moving (not needed for pawns) or Castle
Rank or File of the original square (Must for pawn and only needed for other pieces if two or more pieces of the same type can reach the move's destination square)
Does the move captures the opponent's piece or not (Needed only when there's a capture)
Destination square (Mandatory)
A promotion symbol if the move promotes a pawn to another piece (Needed only when the move promotes a pawn to any piece)
Promoted piece if the piece is promoted (Needed only when move promotes a pawn to any piece)
A symbol denoting if it's a Check or Mate (Needed only when there's a check or Mate)

As you can see much of the above-mentioned informations are optional. Each of the above-mentioned 7 informations are denoted using a symbol. In a plain PGN file, these symbols are written in Characters which takes 1 byte each but since these symbols are limited we can denote them in bits as per their number of possibilities.

For example, the result has only 3 possibilities 1-0,0-1, or ½-½. So we can write the result in just 3 bits but as we can only write a full byte, we will pad the rest of the 5 bits and any result can be written using just 1 byte instead of 3 bytes for black or white's win and 7 bytes for a draw. It itself saves more than 50% of space.

This is basically a type of encoding for SAN. Let's design the encoding for each part

Encoding a chess Piece:

There are six kinds of pieces in chess:

Pawn
Rook
Knight
Bishop
Queen
King

Since can understand the color of the piece from the move order, we do not need to store color information of the piece and 6 values can be written in 3 bits. So piece will take only 3 bits

Pawn - 000
Rook - 001
Knight - 010
Bishop - 011
Queen - 100
King - 101

Encoding Rank or File of the chessboard

The rank of chess board is denoted by the number of rows which goes from 1-8 and the file is denoted by an alphabet assigned to each column that goes from a-h so in total there are 16 possibilities and we can write 16 numbers in 4 bits

1- 0000
2 - 0001
8 - 1000
a - 1001
b - 1010
h - 1111

So each file or rank will take 4 bits and hence starting position will take only 4 bits.

Capture or No Capture

One bit, If it's true, then it's a move that captures another piece otherwise not.

Destination square

Destination squares can be any one among 64 squares, 64 can be written using 6 bits (2^6 = 64). we Can also split these 6 bits into two parts, 3 for file rest 3 for rank as both are of just 8 types and 8 can be written in 3 bits.

Promotion, Check, and Mate

These are special operations that may appear in a move. These are of just 3 types but combinations are possible, as a move can deliver check with promotion so there are in total 5 types and will take 3 bits.

In programming terms below are properties

    private boolean castle;
    private boolean hasStartingPosition;
    private boolean captures;
    private boolean[] startPosition = null; //4 bit
    private boolean[] targetFile = null; // 3bit
    private boolean[] targetRow = null; //3 bit
    private boolean[] operation = null; //3 bit
    private boolean[] piece = null; //3 bit
    private boolean[] promotedPiece = null; //3 bit

Above mentioned properties and space will be applicable when the move is in the memory. While writing it into a file, we will have to consider that we can only write a full byte, which is why we will define a lookup that'll have a byte corresponding to all kinds of tokens that we will have in a move.

What Can be considered as a token in a move

Let's take a move eg: Nxe6+

Here tokens are:

Nx (Knight Takes, it would be only N if there's no capture, meaning in the case of Ne6+)
e6 (Target square)
+ (Operation)

Take another example: axb8=Q+ (a takes on b8 with promotion to the queen and delivering a check to the king)

Here tokens are:

ax (a takes, the moving piece is a pawn since no other piece is mentioned in the move)
b8 (Target Square)
+ (Operation)
= (Operation)
Q (promoted piece)

For all such tokens, we will define a byte and write that into file. QAnd while decoding we will use the same lookup table to bring back the usual SAN. Below is a function that prepares that encoding.

Hence Nxe6+ which is 5 bytes when written as String and 3 bytes when encoded, similarly, axb8=Q+ will take 7 bytes when written as String but when encoded it will take 5 bytes.

Let's take a look at the preparation of this lookup

private static void prepareMOveEncodingMap() {
        String[] files = {"a","b","c","d","e","f","g","h"};
        String[] ranks = {"1","2","3","4","5","6","7","8"};
        String[] pieces = {"R","N","B","Q","K"};
        String[] operation = {"+","=","#"};
        byte counter = Byte.MIN_VALUE;

        //castle moves
        movePartToNativeEncodingMap.put(O_O, counter); counter++;
        movePartToNativeEncodingMap.put(O_O1, counter); counter++;
        movePartToNativeEncodingMap.put(O_O2, counter); counter++;
        movePartToNativeEncodingMap.put(O_O_O, counter); counter++;
        movePartToNativeEncodingMap.put(O_O_O1,counter); counter++;
        movePartToNativeEncodingMap.put( O_O_O2, counter); counter++;


        //prepare simple pawn moves eg : e4,
        for (int i = 0; i < files.length; i++) {
            for (int j = 0; j < ranks.length; j++) {
                movePartToNativeEncodingMap.put(files[i]+ranks[j], counter);
                counter++;
            }
        }

        //prepare pawn capture moves eg: ex of exd4
        for (int i = 0; i < files.length; i++) {
            movePartToNativeEncodingMap.put(files[i]+"x", counter);
            counter++;
        }

        //prepare piece capture moves eg: Nx of Nxg6
        for (int i = 0; i < pieces.length; i++) {
            movePartToNativeEncodingMap.put(pieces[i]+"x", counter);
            counter++;
        }

        //prepare piece with rank position eg. N8 of N8g6
        for (int i = 0; i < pieces.length; i++) {
            for (int j = 0; j < ranks.length; j++) {
                movePartToNativeEncodingMap.put(pieces[i]+ranks[j], counter);
                counter++;
            }
        }

        //prepare piece with file position eg. Nh of Nhg6
        for (int i = 0; i < pieces.length; i++) {
            for (int j = 0; j < files.length; j++) {
                movePartToNativeEncodingMap.put(pieces[i]+files[j], counter);
                counter++;
            }
        }

        //prepare piece capture moves of piece with file eg: Nhx of Nhxg6
        for (int i = 0; i < pieces.length; i++) {
            for (int j = 0; j < files.length; j++) {
                movePartToNativeEncodingMap.put(pieces[i]+files[j]+"x", counter);
                counter++;
            }
        }

        //prepare piece with rank position with capture eg. N8x of N8xg6
        for (int i = 0; i < pieces.length; i++) {
            for (int j = 0; j < ranks.length; j++) {
                movePartToNativeEncodingMap.put(pieces[i]+ranks[j]+"x", counter);
                counter++;
            }
        }

        //prepare operations
        for (int i = 0; i < operation.length; i++) {
            movePartToNativeEncodingMap.put(operation[i],counter);
            counter++;
        }

        //prepare pieces
        for (int i = 0; i < pieces.length; i++) {
            movePartToNativeEncodingMap.put(pieces[i], counter);
            counter++;
        }
        System.out.println("Total Size:" + movePartToNativeEncodingMap.size());
        List nums = new ArrayList<>();
        for(String key : movePartToNativeEncodingMap.keySet()) {
            nums.add(movePartToNativeEncodingMap.get(key));
        }
        Collections.sort(nums);
        for (byte i = -128; i < 127; i++) {
            if(!nums.contains(i) && i != DETAIL_MOVE_SEPARATOR && i != GAME_SEPARATOR) {
                System.out.println(i);
            }
        }
        for(Map.Entry entry : movePartToNativeEncodingMap.entrySet()) {
            //byte[] bytes = new byte[1];
            //bytes[0] = entry.getValue();
            //System.out.println(entry.getKey()+":"+entry.getValue()+":" + new String(bytes, StandardCharsets.UTF_8));
            nativeToMovePartDecodingMap.put(entry.getValue(),entry.getKey());
        }
    }

In a concise manner that's the algorithm to encode a move in such a way that it takes lesser space. You can take a look at the full source code of a TinyMove here

Further Scope of Compression:

There is definitely, further scope for compression. for example, we can define an order and write those bits in that order using padding bits to create full bytes.

for example:

exf6+ should take memory as given below

P => 4 bits (piece)

e => 4 bits (Starting square)

x => 1 bit ( Capture)

f => 3 bits ( target File)

6 => 3 bits ( target Rank)

+ => 3 bits (Operation)

Total: 18 Bits, minimum bits required for full bytes would be 24, hence we will need to pad these 18 bits with 6 bits and the final encoded value would take 24 bits (3 bytes) which is the same as our algorithm but for more complex moves it might take lesser space.

Constructing a Fully encoded PGN file

You can define a PGN file as a class that has the below properties

Map for metadata
Result
List of moves

Moves can be written into encoded moves and the rest of the metadata can be written as a plain string and at the end result can be written using the very last byte of the file.

Take a look at the full code for compressing and decompressing the entire PGN file.

Comments (6)

Jeff Lowery (fensterchess.com)

2024-01-29

Thank you for the thoughtful article.

Steffen

2024-03-19

for you assumption that 256 would be enough to count moves I'd like you to have a look at this game: https://www.chessgames.com/perl/chessgame?gid=1268705

casanovanano3

2024-10-05

nanika1234567

Compressing a PGN file

Navigation Table

Format of a PGN File

Items in the PGN file

Encoding a chess Piece:

Encoding Rank or File of the chessboard

Capture or No Capture

Destination square

Promotion, Check, and Mate

What Can be considered as a token in a move

Further Scope of Compression:

Constructing a Fully encoded PGN file

About Author

Kumar Gaurav

Leave a Comment:

Comments (6)

Jeff Lowery (fensterchess.com)

Jeff Lowery (fensterchess.com)

Steffen

casanovanano3

casanovanano3

casanovanano3