分布式训练学习记录fused_qkv实现逻辑_llama_vs_ernieOn this pagefused_qkv实现逻辑_llama_vs_erniefused_qkv(llama)实现逻辑图:tp2->tp4,num_heads=k_v_nums:tp2->tp4,num_heads>k_v_nums: old_fused_qkv(ernie)实现逻辑图:tp2->tp4,num_heads=k_v_nums: 此时逻辑同上,也是均分最后一维。tp2->tp4,num_heads>k_v_nums: