¿Se puede utilizar código C++ en un programa CUDA C? Cuando incluyo el archivo de cabecera iostream, el compilador confunde la(s) llamada(s) del kernel.

CUDA no tiene soporte de STL. Vector simplemente no funciona en un kernel CUDA. Lo más parecido debe ser la librería Thrust. For example,

thrust::device_vector vect;

but it doesn’t work in device kernel code. Its just an STL-like structure with a GPU backed storage so that you can do the math on it through GPU-computation using simple codes like this:

thrust::transform(
vect.begin(),
vect.end(),
vect.begin(),
[=] __device__ (float x) {return x + 1;}
);

So its still a CUDA C/C++ program but not an explicit device kernel code. If you meant to run them in device kernel code(not host), then you have to implement one yourself because there are many use cases and every one of them will be highly fluctuating in performance depending on hardware and algorithm.

For example, a parallel-vector in a kernel could be like:

class fake_heap
{
public:
__device__ fake_heap(float * heap)
{
h=heap;
ctr=0;
}
int ctr;
float * h;
};
class vector
{
public:
__device__ vector(fake_heap * heapInput)
{
heap = heapInput;
index = 0;
}
__device__ float& operator[](int idx)
{
return heap->h[idx];
}
__device__ void push_back(float data)
{
index = atomicAdd(heap->h,1);
heap->h[index]=data;
__syncthreads();
}
int index;
fake_heap * heap;
};
extern "C"
__global__ void parallel_vector_test(int * __restrict__ data)
{
const int i=threadIdx.x + blockIdx.x * blockDim.x;
__shared__ float memory[256];
fake_heap heap(memory);
vector vect(&heap); // a vector that spans N threads
vect.push_back(i); // all threads push into same vector
data[i]=vect[i];
}

output of data array is (launched with 32-threads, pushed into same same vector of 32 elements):

0,8,1,9,2,10,3,11,4,12,5,13,6,14,
7,15,16,24,17,25,18,26,19,27,20,
28,21,29,22,30,23,31

and not everyone wants it this way. Puede ser cualquier cosa. Alguien puede querer que los valores se ordenen igual que los valores del thread-id, alguien puede necesitar que el tamaño del vector no esté limitado por unos pocos kilobytes y que permanezca en la memoria global y, muy probablemente, alguien puede usar estructuras de vectores por thread en su lugar, para tener múltiples vectores independientes.

Pero, como se puede ver en las definiciones de "clase", se puede usar "C++" (aunque un poco recortado en características) en el código del núcleo. Mira el operador de índice sobrecargado. Ciertamente no es C99. A partir de estas capacidades similares a las de C++, puedes construir contenedores avanzados para hacer tu trabajo de la manera que quieras. For example, you can do this in a CUDA device function:

float * mem = new float[256];
delete mem;

but this will be in local memory and possibly cached by L1 and is per-thread allocation/deletion but is too slow since there is (at least)256 stride-length between neighboring threads accessing their own mem buffer with same index. You may need a parallel allocator/indexer implementation for this to work fast.

When you include a header, it confuses its calling source

__device__
__global__
__host__

these do not exist in definitions of functions in headers that are irrelevant to CUDA. You can try some CUDA compiler options to make it assume all functions in the header to be

__device__

so that it may be called from device kernel.