# GPGPU/FPGA programming for Combinatorial Analysis

Recently, I have taken an interest in performing combinatorial analysis for the game of 21 (blackjack) and attempted to use my AMD APU to try and thread the program via the 4 cores on the chip. Generating the proper strategy and expectations has proven to be a success, as well as taking very few seconds (around 12) to compute the optimal strategy for any given rule(s) and deck composition. However, when it comes to computing other deck compositions, it would take 12.63 years to compute all deck states…for a single deck. So, another approach would be to simulate a select limit of random deck subsets via a Monte-Carlo method. This will work! However, the 12 seconds per analysis is a huge bottleneck! Randomly selecting just 10E6 subsets would take around 139 days. Not feasible with my current CPU, even with threading.

A primer on how I am analysing the strategy chart(s):

1.) I first start by computing what the overall expectation for standing is by taking a set of n_cards for the player and enumerating all possible dealer hand sets. I compare the outcome of each specific dealer total against the current hand total of the player hand.

2.) I then compute the hitting expectation by taking a player hand set {23} and finding matching hand sets for x:2->A, so that we can compute the weighed expectation of hitting to {232}, {233}, {234}, …,{23A}. We do this by computing all hard 21’s first, then hard 20’s…down to all hard 12’s; then do the same by computing hitting all soft 21’s to soft 13’s; then compute all hard 11’s to hard 5; and finally, we compute hitting all pair cards ({22}, {33}, …, {TT}, {AA}.)

3.) Doubling is the same as hitting, except we only draw one card and stand for any 2_card player hand. We return the expectation of drawing to a specific 3_card hand and multiply the expectation by 2 times the probability of that drawn card.

4.) For splitting, we will take steps 1.) to 3.), we will repeat 1->3 up to 11 times. Why? We are computing the expectation of splitting to a new hand by removing the rank for which we are splitting. (For example: We want to split {22}. We will compute standing, hitting, doubling for a given deck subset, minus one of the ranks we are splitting. If we originally split {22} for a deck subset of {4 4 4 4 4 4 4 4 16 4: 2-A} or {3 1 4 3 4 4 3 1 14 3: 2-A}, our new subsets would be {3 4 4 4 4 4 4 4 16 4: 2-A} or {2 1 4 3 4 4 3 1 14 3: 2-A}.)

5.) After all of this, we should get back out optimal strategy with respect to the computed expectation(s) for each possible decision for each player hand.

Now, I have been looking into seeing if there is a quicker way of analysing the strategy using either GPGPU or FPGA/ASIC. I first looked at GPU programming as a possible route, however, I am not sure if it is feasible as there are 3072 unique player hand states, meaning there will be 3072 unique data structures to compute for standing, hitting, doubling, and splitting.

With all this said, I was hoping there was a way to perform some form of simultaneous processing, where the state of each expectation changes dependent on the change in the rules and/or the deck state. That is, rather than recompute each expectation for every change in rules/deck_state, there would be some dependence on the output for each data point. For example: We have three data points named A, B, and C. We also have state data {1 2 3}. For each data point, there is a state equation that we possess:

A = data[0]

B = A + data[1]

C = B +data[2]

Under serial processing, we would evaluate A first for 1, B for 2 plus A, and C for 3 plus B. If we change data to be {2 2 3}, then A would be 2, B would be 4, and C would be 7. Under serial processing, this would take several cycles per data point, for each data point. Is there a method under FPGA where the change in state is instantaneous? That is if we change any point of data, that A, B, and C will change after data changes, simultaneously?

If not, what benefits does FPGA/GPGPU processing offer to a programmer who wants to accurately and quickly compute large volumes of floating point data in a near instant? Basically, is there a way to rapidly compute the optimal strategy of a blackjack game that is faster than multi-threaded CPU processing that is not 12 seconds long?

Computational Science Asked by dogman_1234 on December 31, 2020

I am writing a general answer about porting a program running on a CPU to a GPU or FPGA.

Both GPU programs (using say CUDA) and CPU programs are written in high level languages like C, C++. Therefore it is much easier to port a CPU program to its equivalent on a GPGPU. The algorithm that you have presented seems suitable for porting to GPU. It is compute intensive, repetitive. Some effort, and you will be able to achieve good amount of speedup.

Framework like OpenCL, OneAPI (Intel) primary objective is to automate the process of porting a CPU program to accelerators like GPU or FPGA. But they still have not achieved wide acceptance. Automation is still not good enough across problem domains. In my personal experience it is much easier to write a GPU program from scratch. For somebody who can program in C it is not very difficult.

FPGA is a different game altogether. Like for the logic that you mention in your example using three data points, can be easily implemented in an FPGA using two Adders. The resultant circuit is a combinational circuit, hence using appropriate adder (like carry look-ahead) the computation can be performed within a nano-second on modern FPGA.

But that would require learning a HDL language like verilog or VHDL. It will require a lot of effort.

Alternative is to just wait for some of the framework like OneAPI or OpenCL to get refined enough that they are able to generate an equivalent FPGA HDL code from a high level language(like in C) program.

Personally, I believe, it is not going to be anytime soon.

Answered by ajit on December 31, 2020

## Related Questions

### Algorithm to convert STL files to STEP files

0  Asked on June 28, 2021 by blunova

### How does the number of function calls in BFGS scale with the dimensionality of space

1  Asked on June 27, 2021

### Integral over a surface, given experimental data

1  Asked on June 27, 2021

### Developing a meshfree contouring algorithm

1  Asked on June 27, 2021

### How to take convolution of two arrays in Python by using NumPy?

0  Asked on June 27, 2021

### Asymptotic complexity of fixed-rank SVD

1  Asked on June 25, 2021

### Renumbering the nodes for quadratic basis functions for a 2D domain

1  Asked on June 25, 2021

### How to convert this code to scan random datas instead of binned datas?

0  Asked on June 24, 2021 by dark-knight45

### Choosing good modelling method for solving Boltzmann equation

0  Asked on June 21, 2021 by pawe-j

### Specifying mesh spacing for DFT in numpy

1  Asked on June 21, 2021 by user30058

### Mesh aspect ratio issue with adaptive mesh refinement (AMR)

1  Asked on June 21, 2021 by freewill

### Algebraic multigrid as solver and as preconditioner

1  Asked on June 20, 2021

### Initial condition precision

1  Asked on June 20, 2021 by zebx

### Software to build a mesh of a surface from points on the surface

2  Asked on June 20, 2021

### CFD visualization workflow: Visit vs Paraview vs Tecplot and others

2  Asked on June 18, 2021 by aurelius

### Scaling/Performance of Matlab’s svds function (Lanczos bidiagonalization)

1  Asked on June 17, 2021 by davewy

### Galerkin Least-Squares stabilization for FEM solution advection (hyperbolic) equations

0  Asked on June 17, 2021 by blab

### scipy odeint: excess work done on this call and very sensitive to initial value

1  Asked on June 13, 2021 by b-dog

### Learning the art/science of structural idealization

0  Asked on June 12, 2021