This article follows up on a previous article on semi-supervised learning. I have implemented graph-based semi-supervised learning from scratch in CUDA C++. A graph is a data structure that can be used to denote the relationship among the data. Here we are going to exploit their mutual relationship to conclude their labels.

**Graph-based Semi-Supervised learning (GSSL)**

GSSL consists of mainly 2 steps, first is graph construction and then label propagation. The label with the highest confidence is taken as the corresponding instance label. Consider the image below Fig. 1 there are 6 data instances where only three labels are known which are denoted in respective class colors. Edge Boldness between them denotes the similarity among them, i have omitted edges between instances which are far because they have less impact in label prediction. In Fig. 2 we will get class-wise label probability for each data instance after label propagation and then assign labels.

**Algorithm**

let us have n data instance denoted as x0 to xn and having c classes from 0 to c-1, W (similarity matrix) is an nxn matrix, Y (initial label matrix) is nxc matrix where Yij =1 if ith instance belongs to jth class else Yij=0.

- Calculate the weight matrix pairwise matrix W where

2. Normalise the matrix Wij to get S

where in which D is a diagonal matrix where Dii = sum(Wi)

3. Calculate soft label matrix F,

where Y is the initial label matrix where α= 0.9.

4. Find the final label for ith instance yi = argmax Fi

**Implementation**

The entire algorithm is modularised to CudaGSSL which can be called from Python. Bindings are created using pybind11 (explained in my older article). I have chosen CUDA because most of the steps are parallelizable and can be executed faster than CPU. In this section, I will give an overview to understand the implementation of each file. link to the entire implementation is given here “https://github.com/aromalma/cuda-gssl.git”.

**gssl.cu**: This is the file which is having class CudaGSSL and bindings. The class object is constructed from a 2D numpy array (data instances of shape nxd). Here we are considering this 2D array as 1D inside the kernel. Cuda kernels cannot return arrays so we have to pass-by pointer. While calling the kernels, a sufficient number of threads should be allocated by configuring block and grid dimensions. What? Where? to execute can be controlled by their thread ID. Global memory is allocated and passed to the kernel for execution. Each kernel manipulates the global memory and produces results. Kernel weight_matrix_calc is configured with <<<grids_dim,block_dim>>> for execution inside the gen_w function. Based on the operations thread dimensions can be 1, 2, or 3. To get a basic understanding of thread organization check this link https://www.tutorialspoint.com/cuda/cuda_keywords_and_thread_organization.htm

`//.............................code snippet..................................//`

class CudaGSSL {

public:

double *a,*c_a;

double *w,*c_w,*c_wi;

int n=1;

std::vector<ssize_t> shape;

double *c_te,*te;CudaGSSL(py::array_t<double> & arr){

auto buf = arr.request();

a = (double *)buf.ptr;

shape = buf.shape;

for(int i:shape){n*=i;}

w = new double [shape[0]*shape[0]];

te= new double [shape[0]+1];

// allocate mem for data instance features

cudaMalloc( &c_a,n* sizeof(double));

// allocate mem for W matrdix

cudaMalloc( &c_w,shape[0]*shape[0]* sizeof(double));

// allocate mem for inverse matrix

cudaMalloc( &c_wi,shape[0]*shape[0]* sizeof(double));

for(int k=0;k<shape[0];k++){te[k]=0.0;}

// temp global matrix for saving itermediate results

cudaMalloc(&c_te,(shape[0]+1)*sizeof(double));

// copy elements from host2device

cudaMemcpy( c_te, te, (shape[0]+1) * sizeof(double),cudaMemcpyHostToDevice );

}

~CudaGSSL(){

// free mem space aftr use

delete [] w;

delete [] te;

cudaFree(c_a);

cudaFree(c_wi);

cudaFree(c_w);

cudaFree(c_te);

}

//....................................................................//

//.............................code snippet..................................//

void gen_w(){

cudaMemcpy( c_a, a, n * sizeof(double),cudaMemcpyHostToDevice );

//BLK defined in header as 32

dim3 block_dim(BLK,BLK) ;

// for 1D threads

int threadsPerBlock = 256;

// total threads should be sufficient for

dim3 grids_dim((shape[0] + block_dim.x - 1) / block_dim.x, (max(shape[0],shape[1]) + block_dim.y - 1) / block_dim.y);

int blocksPerGrid = (shape[0] + threadsPerBlock - 1) / threadsPerBlock;

weight_matrix_calc<<<grids_dim,block_dim>>>(c_a,c_w,shape[0],shape[1]);

//Synchronise all thread till here

cudaDeviceSynchronize();

// GET MEAN OF ELEMENTS USING PARALLEL REDUCTION

mean<<<grids_dim,block_dim>>>(c_w, c_te,shape[0]);

cudaDeviceSynchronize();

//......................................................................//

The below function get_w returns the final output F softlabel matrix. Before returning the result should be copied to the host (CPU).

`//fn for return W to the python side`

py::array_t<double> get_w(){

cudaMemcpy(w,c_w,shape[0]*shape[0]* sizeof(double),cudaMemcpyDeviceToHost);

py::array_t<double> numpy_array({shape[0],shape[0]}, w);

return numpy_array;

}

**kernel.cu:** This file consists of different cuda kernels for execution, here I will explain two of them. weight_matrix_calc is called to calculate the pairwise Euclidean distance measure and update the W matrix. Diagonal elements are set to 0. Execution at various threads is controlled using if statements and their IDs.

`#include "kernels.h"`

#include <math.h>

__global__ void weight_matrix_calc(double *a, double * c,int n,int d) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int k;

// diagonal zero

if (i<n && j==0) c[i*n+i]=0.0;

// find pairwise distance, only upper attainable (symmetric)

if (i<n && j <n && j>i){

double s=0;

for (k=0;k<d;k++){

s+=pow((a[i*d+k]-a[j*d+k]),2);

}

c[i*n+j]=s;

c[i+n*j]=s;

}

}// parallel reduction sum and mean

__global__ void mean(double *c,double *d ,int n){

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int tid = BLK*threadIdx.x+threadIdx.y;

// for summing elements

__shared__ double temp[BLK*BLK];

int stride=BLK*BLK/2;

// copy elements to shared memory

if (i<n &&j<n )temp[tid]=c[i*n+j];

else temp[tid]=0.0;

// block level synchronization

__syncthreads();

if (i<n && j<n){

while (stride > 0){

if (tid < stride){

temp[tid]+=temp[tid+stride];

// temp[BLK*threadIdx.x+threadIdx.y+thread]=0.0;

}

// block level synchronization

__syncthreads();

stride/=2;

}

__syncthreads();

// sum up temp[0] of all blocks, without race condition

if (threadIdx.x ==0 && threadIdx.y==0 ) {

//without race condition

atomicAdd(&d[0],temp[0]/(n*n-n));

__syncthreads();

}

}

}

Here the kernel “mean” is calculating the mean of W using parallel reduction. The entire kernel operation can be understood from Fig 2. The shared memory is not accessible outside its block. so after sum reduction, all temp[0] at each block is summed to d[0]. “__syncthreads()” is used to get block-level synchronization, not grid level. It cannot be achieved inside the kernel.

**demo.py**: This is a simple Python which is used to demonstrate the usage of CudaGSSL.

`from gssl import CudaGSSL`

import numpy as np

from sklearn.datasets import load_iris

from sklearn.preprocessing import StandardScaler

np.random.seed(2)iris = load_iris()

X = np.array(iris.data)

y = np.array(iris.target)

for ii in range(0,len(y),20):

y[ii:ii+19]=-1

scaler = StandardScaler()

# transform data

X = scaler.fit_transform(X)

yy=np.array(y)

uniq=set(yy)-{-1}

# make one ot encoding

one_hot=np.eye(len(uniq)+1,dtype=np.float64)[yy][:,:-1]

one_hot=np.array(one_hot)

# init

g=CudaGSSL(X)

# gen W

g.gen_w()

# return W matrix

g.get_w()

# label propagation

p=g.label_prop(one_hot)

#

print(g.get_wi())

res=np.argmax(p,axis=1)

print("Total:",len(y))

# print(y)

print("total_labelled:",np.sum(y!=-1))

print("Total Unlabelled:",np.sum(y==-1))

print("Accuracy :",np.mean(iris.target==res)*100,"%")

**Output:**

`Final Inverse Matrix`

[[1.14443202e+00 1.12413906e-01 1.41056630e-01 ... 1.81851814e-04

9.16517096e-05 2.47318131e-04]

[1.12413906e-01 1.11524824e+00 1.55394531e-01 ... 2.78776491e-04

1.38099194e-04 3.77555771e-04]

[1.41056630e-01 1.55394531e-01 1.13863452e+00 ... 2.13171684e-04

1.06453308e-04 2.89643574e-04]

...

[1.81851814e-04 2.78776491e-04 2.13171684e-04 ... 1.09289149e+00

7.18274220e-02 9.29884187e-02]

[9.16517096e-05 1.38099194e-04 1.06453308e-04 ... 7.18274220e-02

1.05593038e+00 5.33087866e-02]

[2.47318131e-04 3.77555771e-04 2.89643574e-04 ... 9.29884187e-02

5.33087866e-02 1.07869073e+00]] Total Data Instance: 150

Total Labelled: 7

Total Unlabelled: 143

-1 denotes unlabelled data

Before Labelling: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 2

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 2 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1]

After Labelling: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 2 1 2 1 2 2

1 2 1 1 2 1 2 2 2 2 1 2 1 2 2 1 1 1 2 2 2 1 1 1 2 2 1 1 2 2 2 1 2 2 2 1 2

2 1]

Actual Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2]

Accuracy : 85.33333333333334 %

Reference

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#