tobyjiang Linux Dev Engineer

MMIO Emulation

2019-01-07
tobyjiang
 

MMIO地址引起的EPT退出

MMIO是通过设置spte的保留位来标志的。虚拟机内部第一次访问MMIO的gpa时,发生了EPT_VIOLATION然后check gpa发现对应的pfn不存在(QEMU没有注册),那么认为这是个MMIO,于是set_mmio_spte来标志它的spte是一个MMIO。后面再次访问这个gpa时就发生EPT_MISCONFIG了,进而愉快地调用handle_ept_misconfig -> handle_mmio_page_fault -> x86_emulate_instruction 来处理所有的MMIO操作了。

handle_ept_violation->kvm_mmu_page_fault->handle_mmio_page_fault

int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
{
	u64 spte;
	bool reserved;
	//产生EPT_VIOLATION的gpa命中vcpu上的mmio缓存
    //即vcpu->arch.mmio_gfn == gpa >> PAGE_SHIFT
    //直接返回RET_MMIO_PF_EMULATE,进行指令模拟
	if (mmio_info_in_cache(vcpu, addr, direct))
		return RET_MMIO_PF_EMULATE;
	//返回不存在的spte
	reserved = walk_shadow_page_get_mmio_spte(vcpu, addr, &spte);
	if (WARN_ON(reserved))
		return RET_MMIO_PF_BUG;
	//判断该spte是否是mmio的spte
    //即(spte & shadow_mmio_mask) == shadow_mmio_mask
    //shadow_mmio_mask是在ept_set_mmio_spte_mask进行初始化的
	if (is_mmio_spte(spte)) {
        //从spte中得到物理页框
		gfn_t gfn = get_mmio_spte_gfn(spte);
		unsigned access = get_mmio_spte_access(spte);

		if (!check_mmio_spte(vcpu, spte))
			return RET_MMIO_PF_INVALID;

		if (direct)
			addr = 0;	

		trace_handle_mmio_page_fault(addr, gfn, access);
        //将最近踩到过的mmio gpa对应spte gfn缓存到vcpu的mmio_gfn上,下一次mmio_info_in_cache缓存命中,可以直接返回
		vcpu_cache_mmio_info(vcpu, addr, gfn, access);
		return RET_MMIO_PF_EMULATE;
	}

	/*
	 * If the page table is zapped by other cpus, let CPU fault again on
	 * the address.
	 */
	return RET_MMIO_PF_RETRY;
}

从注释中可以得到主要意思:

static void ept_set_mmio_spte_mask(void)
{
	/*
	 * EPT Misconfigurations can be generated if the value of bits 2:0
	 * of an EPT paging-structure entry is 110b (write/execute).
	 * Also, magic bits (0x3ull << 62) is set to quickly identify mmio
	 * spte.
	 */
	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
}

如果是MMIO 地址引起的EPT退出,在kvm中首先就会根绝gpa进行判断(is_mmio_page_fault),然后进行指令模拟。

我们知道X86体系结构上对设备进行访问可以通过PIO方式和MMIO(Memory Mapped I/O)两种方式进行, 那么QEMU-KVM具体是如何实现设备MMIO访问的呢?

MMIO是直接将设备I/O映射到物理地址空间内,虚拟机物理内存的虚拟化又是通过EPT机制来完成的, 那么模拟设备的MMIO实现也需要利用EPT机制.虚拟机的EPT页表是在EPT_VIOLATION异常处理的时候建立起来的, 对于模拟设备而言访问MMIO肯定要触发VM_EXIT然后交给QEMU/KVM去处理,那么怎样去标志MMIO访问异常呢? 查看Intel SDM知道这是通过利用EPT_MISCONFIG来实现的.那么EPT_VIOLATIONEPT_MISCONFIG的区别是什么?

EXIT_REASON_EPT_VIOLATION is similar to a "page not present" pagefault.
EXIT_REASON_EPT_MISCONFIG is similar to a "reserved bit set" pagefault.

EPT_VIOLATION表示的是对应的物理页不存在,而EPT_MISCONFIG表示EPT页表中有非法的域.

那么这里有2个问题需要弄清楚.

1 KVM如何标记EPT是MMIO类型 ?

hardware_setup时候虚拟机如果开启了ept支持就调用ept_set_mmio_spte_mask初始化shadow_mmio_mask,以后就用这个掩码来判断spte是否是mmio对应的页表项。shadow_mmio_mask最低3bit为:110b,一旦spte&shadow_mmio_mask == shadow_mmio_mask,就会触发ept_msconfig(110b表示该页可读可写但是还未分配或者不存在,这显然是一个错误的EPT页表项).

static void ept_set_mmio_spte_mask(void)
{
    /*
     * EPT Misconfigurations can be generated if the value of bits 2:0
     * of an EPT paging-structure entry is 110b (write/execute).
     */ 
    kvm_mmu_set_mmio_spte_mask(VMX_EPT_RWX_MASK,
                   VMX_EPT_MISCONFIG_WX_VALUE);
}

同时还要对EPT的一些特殊位进行标记来标志该spte表示MMIO而不是虚拟机的物理内存,例如这里

(1)set the special mask:  SPTE_SPECIAL_MASK.
(2)reserved physical address bits:  the setting of a bit in the range 51:12 that is beyond the logical processor’s physical-address width

关于EPT_MISCONFIG在SDM中有详细说明.

EPT_MISCONFIG

void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value)
{
    BUG_ON((mmio_mask & mmio_value) != mmio_value);
    shadow_mmio_value = mmio_value | SPTE_SPECIAL_MASK;
    shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
}       
EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);

static void kvm_set_mmio_spte_mask(void)
{
    u64 mask;
    int maxphyaddr = boot_cpu_data.x86_phys_bits;

    /* 
     * Set the reserved bits and the present bit of an paging-structure
     * entry to generate page fault with PFER.RSV = 1.
     */
     /* Mask the reserved physical address bits. */
    mask = rsvd_bits(maxphyaddr, 51);

    /* Set the present bit. */
    mask |= 1ull;

#ifdef CONFIG_X86_64
    /*
     * If reserved bit is not supported, clear the present bit to disable
     * mmio page fault.
     */
    if (maxphyaddr == 52)
        mask &= ~1ull;
#endif

    kvm_mmu_set_mmio_spte_mask(mask, mask);
}       

KVM在建立EPT页表项之后设置了这些标志位再访问对应页的时候会触发EPT_MISCONFIG退出了,然后调用handle_ept_misconfig–>handle_mmio_page_fault来完成MMIO处理操作.

2 QEMU如何标记设备的MMIO ?

这里以e1000网卡模拟为例,设备初始化MMIO时候时候注册的MemoryRegion为IO类型(不是RAM类型).

static void
e1000_mmio_setup(E1000State *d)
{
    int i;
    const uint32_t excluded_regs[] = {
        E1000_MDIC, E1000_ICR, E1000_ICS, E1000_IMS,
        E1000_IMC, E1000_TCTL, E1000_TDT, PNPMMIO_SIZE
    };
    // 这里注册MMIO,调用memory_region_init_io,mr->ram = false!!!
    memory_region_init_io(&d->mmio, OBJECT(d), &e1000_mmio_ops, d,
                          "e1000-mmio", PNPMMIO_SIZE);
    memory_region_add_coalescing(&d->mmio, 0, excluded_regs[0]);
    for (i = 0; excluded_regs[i] != PNPMMIO_SIZE; i++)
        memory_region_add_coalescing(&d->mmio, excluded_regs[i] + 4,
                                     excluded_regs[i+1] - excluded_regs[i] - 4);
    memory_region_init_io(&d->io, OBJECT(d), &e1000_io_ops, d, "e1000-io", IOPORT_SIZE);
}

结合QEMU-KVM内存管理知识我们知道, QEMU调用kvm_set_phys_mem注册虚拟机的物理内存到KVM相关的数据结构中的时候 会调用memory_region_is_ram来判断该段物理地址空间是否是RAM设备, 如果不是RAM设备直接return了.

static void kvm_set_phys_mem(KVMMemoryListener *kml,
                             MemoryRegionSection *section, bool add)
{
    ......
    if (!memory_region_is_ram(mr)) {
        if (writeable || !kvm_readonly_mem_allowed) {
            return;     // 设备MR不是RAM但可以写,那么这里直接return不注册到kvm里面
        } else if (!mr->romd_mode) {
            /* If the memory device is not in romd_mode, then we actually want
             * to remove the kvm memory slot so all accesses will trap. */
            add = false;
        }
    }
    ......
}

0 0x000055555572485e in kvm_set_phys_mem (kml=0x55555661e1c0, section=0x7fffe5d923c0, add=true) at /home/tobyjiang/Documents/qemu-2.6.0/kvm-all.c:63

1 0x0000555555724eb8 in kvm_region_add (listener=0x55555661e1c0, section=0x7fffe5d923c0) at /home/tobyjiang/Documents/qemu-2.6.0/kvm-all.c:798

2 0x000055555572c666 in address_space_update_topology_pass (as=0x5555560d9ce0 , old_view=0x7fffe01aa0c0, new_view=0x7fffe0003f30, adding=true)

at /home/tobyjiang/Documents/qemu-2.6.0/memory.c:870

3 0x000055555572c750 in address_space_update_topology (as=0x5555560d9ce0 ) at /home/tobyjiang/Documents/qemu-2.6.0/memory.c:885

4 0x000055555572c890 in memory_region_transaction_commit () at /home/tobyjiang/Documents/qemu-2.6.0/memory.c:925

5 0x000055555572f70a in memory_region_update_container_subregions (subregion=0x7fffe549f8b0) at /home/tobyjiang/Documents/qemu-2.6.0/memory.c:1902

6 0x000055555572f779 in memory_region_add_subregion_common (mr=0x55555669ed50, offset=4273733632, subregion=0x7fffe549f8b0) at /home/tobyjiang/Documents/qemu-2.6.0/memory.c:1912

7 0x000055555572f807 in memory_region_add_subregion_overlap (mr=0x55555669ed50, offset=4273733632, subregion=0x7fffe549f8b0, priority=1) at /home/tobyjiang/Documents/qemu-2.6.0/memory.c:1931

8 0x0000555555964b51 in pci_update_mappings (d=0x7fffe549d010) at /home/tobyjiang/Documents/qemu-2.6.0/hw/pci/pci.c:1185

9 0x0000555555964e38 in pci_default_write_config (d=0x7fffe549d010, addr=4, val_in=259, l=2) at /home/tobyjiang/Documents/qemu-2.6.0/hw/pci/pci.c:1237

10 0x000055555593ed43 in e1000_write_config (pci_dev=0x7fffe549d010, address=4, val=259, len=2) at /home/tobyjiang/Documents/qemu-2.6.0/hw/net/e1000.c:180

对于MMIO类型的内存QEMU不会调用kvm_set_user_memory_region对其进行注册, 那么KVM会认为该段内存的pfn类型为KVM_PFN_NOSLOT, 进而调用set_mmio_spte来设置该段地址对应到spte, 而该函数中会判断pfn是否为NOSLOT标记以确认这段地址空间为MMIO。

static bool set_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
              kvm_pfn_t pfn, unsigned access)
{
    if (unlikely(is_noslot_pfn(pfn))) {
        mark_mmio_spte(vcpu, sptep, gfn, access);//对spte进行设置
        return true;
    }

    return false;
}

3 总结

MMIO是通过设置spte的保留位来标志的. 虚拟机内部第一次访问MMIO的gpa时,发生了EPT_VIOLATION然后check gpa发现对应的pfn不存在(QEMU没有注册),那么认为这是个MMIO,于是set_mmio_spte来标志它的spte是一个MMIO. 后面再次访问这个gpa时就发生EPT_MISCONFIG了,进而愉快地调用handle_ept_misconfig -> handle_mmio_page_fault -> x86_emulate_instruction 来处理所有的MMIO操作了.


上一篇 同步一个 fork

Comments

Content