cuda 相关问题

CUDA是Nvidia GPU（图形处理单元）的并行计算平台和编程模型。 CUDA通过各种编程语言，库和API为Nvidia GPU提供了一个接口。

<code>__global__ void HelloFromGPU(int nth) { nth = 4; int col = blockIdx.x * blockDim.x + threadIdx.x; int row = blockIdx.y * blockDim.y + threadIdx.y; while (1) { } } int main() { int count; cudaGetDeviceCount(&count); printf("CPU\n"); dim3 grid(2,2); dim3 block(2,2); HelloFromGPU<<<grid,block>>>(32); printf("after kernel\n"); printf("start malloc\n"); int *dev; int ret = cudaMalloc((void**)&dev, sizeof(int)); printf("after malloc %d\n", ret); printf("start memcpy\n"); ret = cudaMemset(dev, 0, sizeof(int)); printf("after memset %d\n", ret); ret = cudaMemcpy(dev, &count, sizeof(int), cudaMemcpyHostToDevice); printf("after memcpy %d\n", ret); cudaFree(dev); printf("end\n"); return 0; } </code>

具有隐式块，需要等待所有内核或内存操作完成。但是在我的测试中，输出是：

cuda

回答 1 投票 0

windows10

cuda

回答 1 投票 0

库达：curand_uriform（）分布不像预期的那样随机

使每个内核线程生成一个随机数。我正在通过将每个生成的数字视为矩阵数组中的每个生成的数字来测试我的程序生成的随机性，将索引索引设置为1，然后在程序结束时将所有设置为1的索引汇总到所有索引。 suppose I启动与矩阵中索引相同数量的内核，这意味着17K x 17k矩阵将启动17K^2核，并生成限于[0,17K^2）的随机数。当我使用

random cuda distribution nvcc uniform-distribution

回答 1 投票 0

在隐式整数降落/截断时，如何使NVCC出错？

cuda nvcc

回答 1 投票 0

GFORTRAN错误：预期的右括号

nvidia cublas文档中给出的fortran 77示例程序之一。这个小示例程序涉及使用NVIDIA提供的FORTRAN绑定来调用Fortran应用程序的Cublas函数。该代码使用C风格的宏

compiler-errors cuda fortran gfortran

回答 0 投票 0

我希望此消息能找到您的状况。我正在尝试使用CUDA流执行异步任务。从我的理解来看，cudamcpyasync应该异步运行。但是，由于某种原因，必须在下一个流中的任务开始之前完成一个流中的任务。以下是我的代码：

在我的示例代码中，每个流都使用两个cudamemcpyasync调用。但是，如果我删除其中一个，则流式执行异步。我已经检查了与固定分配有关的任何共享零件。此外，如果我修改了如下所示的代码，则用于主机到设备的cudamemcpyasync操作和内核执行再次开始在流中不同步。 memcpy（hostsourceimage [0]，hostsourceimage，全width *全力 * 1）;

cuda cuda-streams

回答 0 投票 0

由于“无与伦比的工作”，试图结束流捕获失败；但是当捕获进展中时同步失败会失败

cuda runtime-error cuda-graphs

回答 1 投票 0

下面的代码总是在其他任何内容之前打印出“ Hello the Start”，而“ Hello from the Ender”之后，为什么这是什么？代码： #include

下面的代码总是在其他任何东西之前打印出“ Hello the Start”，而“ Hello from the Ender”之后，为什么？

c++ cuda printf

回答 1 投票 0

为什么我的Llama 3.1模型在Automodelforcausallm和Llamaforcausallm之间的作用有所不同？我有一组权重，一个令牌，同一提示和相同的生成参数。然而，不知何故，当我使用AutoModelForCausAllm加载模型时，我会得到一个输出，当我构造...

import torch from transformers import ( AutoTokenizer, AutoModelForCausalLM, LlamaForCausalLM, LlamaConfig ) # 1) Adjust these as needed model_name = "meta-llama/Llama-3.1-8B" prompt = "Hello from Llama 3.1! Tell me something interesting." dtype = torch.float16 # or torch.float32 if needed # 2) Get the tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) # Prepare input inputs = tokenizer(prompt, return_tensors="pt").to("cuda") ############################################ # A) Load with AutoModelForCausalLM ############################################ print("=== Loading with AutoModelForCausalLM ===") model_auto = AutoModelForCausalLM.from_pretrained( model_name, attn_implementation="eager", # matches your usage torch_dtype=dtype ).cuda() model_auto.eval() # turn off dropout config = model_auto.config with torch.no_grad(): out_auto = model_auto(**inputs) logits_auto = out_auto.logits # shape: [batch_size, seq_len, vocab_size] del model_auto torch.cuda.empty_cache() ############################################ # B) Load with LlamaForCausalLM + config ############################################ print("=== Loading with LlamaForCausalLM + config ===") # Get config from the same checkpoint # Build Llama model directly model_llama = LlamaForCausalLM(config).cuda() model_llama.eval() # Load the same weights that AutoModelForCausalLM used model_auto_temp = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=dtype) model_llama.load_state_dict(model_auto_temp.state_dict()) del model_auto_temp torch.cuda.empty_cache() with torch.no_grad(): out_llama = model_llama(**inputs) logits_llama = out_llama.logits ############################################ # C) Compare the Logits ############################################ # Compute maximum absolute difference max_diff = (logits_auto - logits_llama).abs().max() print(f"\nMax absolute difference between logits: {max_diff.item()}") if max_diff < 1e-7: print("→ The logits are effectively identical (within floating-point precision).") else: print("→ There is a non-trivial difference in logits!")

python pytorch cuda huggingface llama

回答 1 投票 0

cuda c ++错误：__device__函数呼叫无法配置

c++ parallel-processing cuda gpu gpgpu

回答 1 投票 0

我有与代码非常相似的东西：

< no_streams; k++) cudaStreamCreate(&stream[k]); cudaMalloc(&g_in, size1*

cuda cuda-streams

回答 1 投票 0

CUDA“驱动程序版本”看起来像CUDA运行时版本 - 那有什么区别？

cuda version nvidia

回答 1 投票 0