GPGPU/FPGA programming for Combinatorial Analysis

Question

Recently, I have taken an interest in performing combinatorial analysis for the game of 21 (blackjack) and attempted to use my AMD APU to try and thread the program via the 4 cores on the chip. Generating the proper strategy and expectations has proven to be a success, as well as taking very few seconds (around 12) to compute the optimal strategy for any given rule(s) and deck composition. However, when it comes to computing other deck compositions, it would take 12.63 years to compute all deck states...for a single deck. So, another approach would be to simulate a select limit of random deck subsets via a Monte-Carlo method. This will work! However, the 12 seconds per analysis is a huge bottleneck! Randomly selecting just 10E6 subsets would take around 139 days. Not feasible with my current CPU, even with threading.

A primer on how I am analysing the strategy chart(s):

1.) I first start by computing what the overall expectation for standing is by taking a set of n_cards for the player and enumerating all possible dealer hand sets. I compare the outcome of each specific dealer total against the current hand total of the player hand.

2.) I then compute the hitting expectation by taking a player hand set {23} and finding matching hand sets for x:2->A, so that we can compute the weighed expectation of hitting to {232}, {233}, {234}, ...,{23A}. We do this by computing all hard 21's first, then hard 20's...down to all hard 12's; then do the same by computing hitting all soft 21's to soft 13's; then compute all hard 11's to hard 5; and finally, we compute hitting all pair cards ({22}, {33}, ..., {TT}, {AA}.)

3.) Doubling is the same as hitting, except we only draw one card and stand for any 2_card player hand. We return the expectation of drawing to a specific 3_card hand and multiply the expectation by 2 times the probability of that drawn card.

4.) For splitting, we will take steps 1.) to 3.), we will repeat 1->3 up to 11 times. Why? We are computing the expectation of splitting to a new hand by removing the rank for which we are splitting. (For example: We want to split {22}. We will compute standing, hitting, doubling for a given deck subset, minus one of the ranks we are splitting. If we originally split {22} for a deck subset of {4 4 4 4 4 4 4 4 16 4: 2-A} or {3 1 4 3 4 4 3 1 14 3: 2-A}, our new subsets would be {3 4 4 4 4 4 4 4 16 4: 2-A} or {2 1 4 3 4 4 3 1 14 3: 2-A}.)

5.) After all of this, we should get back out optimal strategy with respect to the computed expectation(s) for each possible decision for each player hand.

Now, I have been looking into seeing if there is a quicker way of analysing the strategy using either GPGPU or FPGA/ASIC. I first looked at GPU programming as a possible route, however, I am not sure if it is feasible as there are 3072 unique player hand states, meaning there will be 3072 unique data structures to compute for standing, hitting, doubling, and splitting.

With all this said, I was hoping there was a way to perform some form of simultaneous processing, where the state of each expectation changes dependent on the change in the rules and/or the deck state. That is, rather than recompute each expectation for every change in rules/deck_state, there would be some dependence on the output for each data point. For example: We have three data points named A, B, and C. We also have state data {1 2 3}. For each data point, there is a state equation that we possess:

A = data[0]

B = A + data[1]

C = B +data[2]

Under serial processing, we would evaluate A first for 1, B for 2 plus A, and C for 3 plus B. If we change data to be {2 2 3}, then A would be 2, B would be 4, and C would be 7. Under serial processing, this would take several cycles per data point, for each data point. Is there a method under FPGA where the change in state is instantaneous? That is if we change any point of data, that A, B, and C will change after data changes, simultaneously?

If not, what benefits does FPGA/GPGPU processing offer to a programmer who wants to accurately and quickly compute large volumes of floating point data in a near instant? Basically, is there a way to rapidly compute the optimal strategy of a blackjack game that is faster than multi-threaded CPU processing that is not 12 seconds long?

ajit · Answer

I am writing a general answer about porting a program running on a CPU to a GPU or FPGA.
Both GPU programs (using say CUDA) and CPU programs are written in high level languages like C, C++. Therefore it is much easier to port a CPU program to its equivalent on a GPGPU. The algorithm that you have presented seems suitable for porting to GPU. It is compute intensive, repetitive. Some effort, and you will be able to achieve good amount of speedup.
Framework like OpenCL, OneAPI (Intel) primary objective is to automate the process of porting a CPU program to accelerators like  GPU or FPGA. But they still have not achieved wide acceptance. Automation is still not good enough across problem domains.  In my personal experience it is much easier to write a GPU program from scratch. For somebody who can program in C it is not very difficult.
FPGA is a different game altogether. Like for the logic that you mention in your example using three data points, can be easily implemented in an FPGA using two Adders. The resultant circuit is a combinational circuit, hence using appropriate adder (like carry look-ahead) the computation can be performed within a nano-second on modern FPGA.
But that would require learning a HDL language like verilog or VHDL. It will require a lot of effort.
Alternative is to just wait for some of the framework like OneAPI or OpenCL to get refined enough that they are able to generate an equivalent FPGA HDL code from a high level language(like in C) program.
Personally, I believe, it is not going to be anytime soon.

GPGPU/FPGA programming for Combinatorial Analysis

One Answer

Add your own answers!

Ask a Question