Opened 13 months ago
Closed 12 months ago
#563 closed enhancement (fixed)
Ensure Xen free_memory is always guaranteed to be above a fixed threshold
| Reported by: | joanna | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | Release 1 |
| Component: | core | Keywords: | |
| Cc: |
Description
Previously I had no problems starting this VM, but suddenly, today, I crashes:
[ 0.012366] installing Xen timer for CPU 1 [ 0.012394] ------------[ cut here ]------------ [ 0.012400] kernel BUG at /home/user/qubes-src/kernel/kernel-3.2.7/linux-3.2.7/arch/x86/xen/smp.c:322! [ 0.012409] invalid opcode: 0000 [#1] SMP [ 0.012416] CPU 0 [ 0.012418] Modules linked in: [ 0.012424] [ 0.012429] Pid: 1, comm: swapper/0 Not tainted 3.2.7-3.pvops.qubes.x86_64 #1 [ 0.012438] RIP: e030:[<ffffffff8143a229>] [<ffffffff8143a229>] cpu_initialize_context+0x263/0x280 [ 0.012452] RSP: e02b:ffff880031863e10 EFLAGS: 00010282 [ 0.012457] RAX: fffffffffffffff4 RBX: ffff8800318c0000 RCX: 0000000000000000 [ 0.012463] RDX: ffff8800318c0000 RSI: 0000000000000001 RDI: 0000000000000000 [ 0.012470] RBP: ffff880031863e50 R08: 00003ffffffff000 R09: ffff880000000000 [ 0.012476] R10: ffff8800318c0000 R11: 0000000000002000 R12: 0000000000000001 [ 0.012482] R13: ffff880031f82d30 R14: ffff88003186e0c0 R15: 0000000000039130 [ 0.012491] FS: 0000000000000000(0000) GS:ffff880031f5c000(0000) knlGS:0000000000000000 [ 0.012498] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 0.012504] CR2: 0000000000000000 CR3: 0000000001805000 CR4: 0000000000002660 [ 0.012510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.012516] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 0.012523] Process swapper/0 (pid: 1, threadinfo ffff880031862000, task ffff880031860040) [ 0.012529] Stack: [ 0.012532] ffff88003186e0c0 0000000000031f7b ffffffff81866c80 0000000000000001 [ 0.012544] ffff88003186e0c0 0000000000000001 ffffffff81866c80 0000000000000001 [ 0.012554] ffff880031863e80 ffffffff8143a2e1 ffff880031863e70 0000000000000000 [ 0.012564] Call Trace: [ 0.012572] [<ffffffff8143a2e1>] xen_cpu_up+0x9b/0x115 [ 0.012579] [<ffffffff81440ad8>] _cpu_up+0x9c/0x10e [ 0.012585] [<ffffffff81440bbf>] cpu_up+0x75/0x85 [ 0.012593] [<ffffffff818998f1>] smp_init+0x46/0x9e [ 0.012600] [<ffffffff8188263c>] kernel_init+0x89/0x142 [ 0.012607] [<ffffffff814518b4>] kernel_thread_helper+0x4/0x10 [ 0.012615] [<ffffffff8144f973>] ? int_ret_from_sys_call+0x7/0x1b [ 0.012624] [<ffffffff81447d7c>] ? retint_restore_args+0x5/0x6 [ 0.012632] [<ffffffff814518b0>] ? gs_change+0x13/0x13 [ 0.012637] Code: 74 0d 48 ba ff ff ff ff ff ff ff 3f 48 21 d0 48 c1 e0 0c 31 ff 49 63 f4 48 89 83 90 13 00 00 48 89 da e8 db 70 bc ff 85 c0 74 04 <0f> 0b eb fe 48 89 df e8 db f6 ce ff 31 c0 48 83 c4 18 5b 41 5c [ 0.012718] RIP [<ffffffff8143a229>] cpu_initialize_context+0x263/0x280 [ 0.012727] RSP <ffff880031863e10> [ 0.012738] ---[ end trace 4eaa2a86a8e2da22 ]--- [ 0.012753] Kernel panic - not syncing: Attempted to kill init! [ 0.012759] Pid: 1, comm: swapper/0 Tainted: G D 3.2.7-3.pvops.qubes.x86_64 #1 [ 0.012765] Call Trace: [ 0.012770] [<ffffffff81444c4a>] panic+0x8c/0x1a2 [ 0.012778] [<ffffffff81059814>] ? enqueue_entity+0x74/0x2f0 [ 0.012785] [<ffffffff8106113d>] forget_original_parent+0x34d/0x360 [ 0.012793] [<ffffffff8100a05f>] ? xen_restore_fl_direct_reloc+0x4/0x4 [ 0.012801] [<ffffffff814478b1>] ? _raw_spin_unlock_irqrestore+0x11/0x20 [ 0.012809] [<ffffffff8104acb3>] ? sched_move_task+0x93/0x150 [ 0.012816] [<ffffffff81061162>] exit_notify+0x12/0x190 [ 0.012822] [<ffffffff81062a3d>] do_exit+0x1ed/0x3e0 [ 0.012828] [<ffffffff814489e6>] oops_end+0xa6/0xf0 [ 0.012833] [<ffffffff81016476>] die+0x56/0x90 [ 0.012837] [<ffffffff81448584>] do_trap+0xc4/0x170 [ 0.012841] [<ffffffff81014440>] do_invalid_op+0x90/0xb0 [ 0.012846] [<ffffffff8143a229>] ? cpu_initialize_context+0x263/0x280 [ 0.012853] [<ffffffff81128ce4>] ? cache_grow.clone.0+0x2b4/0x3b0 [ 0.012857] [<ffffffff8100a05f>] ? xen_restore_fl_direct_reloc+0x4/0x4 [ 0.012862] [<ffffffff810052f1>] ? pte_mfn_to_pfn+0x71/0xf0 [ 0.012867] [<ffffffff8145172b>] invalid_op+0x1b/0x20 [ 0.012873] [<ffffffff8143a229>] ? cpu_initialize_context+0x263/0x280 [ 0.012880] [<ffffffff8143a2e1>] xen_cpu_up+0x9b/0x115 [ 0.012886] [<ffffffff81440ad8>] _cpu_up+0x9c/0x10e [ 0.012893] [<ffffffff81440bbf>] cpu_up+0x75/0x85 [ 0.012899] [<ffffffff818998f1>] smp_init+0x46/0x9e [ 0.012904] [<ffffffff8188263c>] kernel_init+0x89/0x142 [ 0.012911] [<ffffffff814518b4>] kernel_thread_helper+0x4/0x10 [ 0.012918] [<ffffffff8144f973>] ? int_ret_from_sys_call+0x7/0x1b [ 0.012925] [<ffffffff81447d7c>] ? retint_restore_args+0x5/0x6 [ 0.012932] [<ffffffff814518b0>] ? gs_change+0x13/0x13
Change History (12)
comment:1 Changed 13 months ago by joanna
comment:2 Changed 13 months ago by joanna
The maxmem = mem setting, on the other hand, seems necessary for passthrough PCI devices to work fine in this VM -- see #525...
comment:3 Changed 13 months ago by marmarek
This BUG is reaction to hipercall fail. What do you have in xl dmesg (or hypervisor.log)?
Relevant lines from kernel source:
321 if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt)) 322 BUG(); 323
comment:4 Changed 13 months ago by marmarek
Perhaps this is connected with problem that VM sees different amount of memory, than dom0 sets...
I've found recently some patches in Konrad's git tree, that can fix it (http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commit;h=2e2fb75475c2fc74c98100f1468c8195fee49f3b - perhaps with all selfbaloon branch). Maybe this is solution? Will try to apply this branch to our kernel.
comment:5 Changed 13 months ago by joanna
But what is singular about this is that is happened suddenly one day, while before it worked fine with mem=maxmem, and *today* it also works fine with mem=maxmem. Kind of a one-time accident, that I'm unable to reproduce... Very strange.
comment:6 Changed 13 months ago by joanna
- Resolution set to worksforme
- Status changed from new to closed
I cannot reproduce it. I will close it for now.
comment:7 Changed 12 months ago by joanna
- Component changed from other to kernel-dom0
- Priority changed from major to critical
- Resolution worksforme deleted
- Status changed from closed to reopened
Ok, just got this again:
[ 0.004000] CPU: Physical Processor ID: 0 [ 0.004000] CPU: Processor Core ID: 0 [ 0.004000] SMP alternatives: switching to UP code [ 0.008081] Performance Events: unsupported p6 CPU model 42 no PMU driver, software events only. [ 0.008335] installing Xen timer for CPU 1 [ 0.008362] ------------[ cut here ]------------ [ 0.008368] kernel BUG at /home/user/qubes-src/kernel/kernel-3.2.7/linux-3.2.7/arch/x86/xen/smp.c:322! [ 0.008376] invalid opcode: 0000 [#1] SMP [ 0.008400] CPU 0 [ 0.008402] Modules linked in: [ 0.008408] [ 0.008412] Pid: 1, comm: swapper/0 Not tainted 3.2.7-5.pvops.qubes.x86_64 #1 [ 0.008420] RIP: e030:[<ffffffff8143a229>] [<ffffffff8143a229>] cpu_initialize_context+0x263/0x280 [ 0.008433] RSP: e02b:ffff880018063e10 EFLAGS: 00010282 [ 0.008437] RAX: fffffffffffffff4 RBX: ffff8800180c0000 RCX: 0000000000000000 [ 0.008442] RDX: ffff8800180c0000 RSI: 0000000000000001 RDI: 0000000000000000 [ 0.008447] RBP: ffff880018063e50 R08: 00003ffffffff000 R09: ffff880000000000 [ 0.008452] R10: ffff8800180c0000 R11: 0000000000002000 R12: 0000000000000001 [ 0.008457] R13: ffff880018f82d30 R14: ffff88001806e0c0 R15: 000000000004d0d3 [ 0.008467] FS: 0000000000000000(0000) GS:ffff880018f5c000(0000) knlGS:0000000000000000 [ 0.008474] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 0.008479] CR2: 0000000000000000 CR3: 0000000001805000 CR4: 0000000000002660 [ 0.008485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.008490] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 0.008496] Process swapper/0 (pid: 1, threadinfo ffff880018062000, task ffff880018060040) [ 0.008502] Stack: [ 0.008505] ffff88001806e0c0 0000000000018f7b ffffffff81866c80 0000000000000001 [ 0.008515] ffff88001806e0c0 0000000000000001 ffffffff81866c80 0000000000000001 [ 0.008525] ffff880018063e80 ffffffff8143a2e1 ffff880018063e70 0000000000000000 [ 0.008535] Call Trace: [ 0.008542] [<ffffffff8143a2e1>] xen_cpu_up+0x9b/0x115 [ 0.008548] [<ffffffff81440ad8>] _cpu_up+0x9c/0x10e [ 0.008555] [<ffffffff81440bbf>] cpu_up+0x75/0x85 [ 0.008562] [<ffffffff818998f1>] smp_init+0x46/0x9e [ 0.008569] [<ffffffff8188263c>] kernel_init+0x89/0x142 [ 0.008577] [<ffffffff814518b4>] kernel_thread_helper+0x4/0x10 [ 0.008585] [<ffffffff8144f973>] ? int_ret_from_sys_call+0x7/0x1b [ 0.008593] [<ffffffff81447d7c>] ? retint_restore_args+0x5/0x6 [ 0.008600] [<ffffffff814518b0>] ? gs_change+0x13/0x13 [ 0.008604] Code: 74 0d 48 ba ff ff ff ff ff ff ff 3f 48 21 d0 48 c1 e0 0c 31 ff 49 63 f4 48 89 83 90 13 00 00 48 89 da e8 db 70 bc ff 85 c0 74 04 <0f> 0b eb fe 48 89 df e8 db f6 ce ff 31 c0 48 83 c4 18 5b 41 5c [ 0.008682] RIP [<ffffffff8143a229>] cpu_initialize_context+0x263/0x280 [ 0.008691] RSP <ffff880018063e10> [ 0.008701] ---[ end trace 4eaa2a86a8e2da22 ]--- [ 0.008715] Kernel panic - not syncing: Attempted to kill init! [ 0.008722] Pid: 1, comm: swapper/0 Tainted: G D 3.2.7-5.pvops.qubes.x86_64 #1 [ 0.008728] Call Trace: [ 0.008734] [<ffffffff81444c4a>] panic+0x8c/0x1a2 [ 0.008742] [<ffffffff81059814>] ? enqueue_entity+0x74/0x2f0 [ 0.008750] [<ffffffff8106113d>] forget_original_parent+0x34d/0x360 [ 0.008758] [<ffffffff8100a05f>] ? xen_restore_fl_direct_reloc+0x4/0x4 [ 0.008765] [<ffffffff814478b1>] ? _raw_spin_unlock_irqrestore+0x11/0x20 [ 0.008774] [<ffffffff8104acb3>] ? sched_move_task+0x93/0x150 [ 0.008781] [<ffffffff81061162>] exit_notify+0x12/0x190 [ 0.008787] [<ffffffff81062a3d>] do_exit+0x1ed/0x3e0 [ 0.008794] [<ffffffff814489e6>] oops_end+0xa6/0xf0 [ 0.008801] [<ffffffff81016476>] die+0x56/0x90 [ 0.008807] [<ffffffff81448584>] do_trap+0xc4/0x170 [ 0.008813] [<ffffffff81014440>] do_invalid_op+0x90/0xb0 [ 0.008820] [<ffffffff8143a229>] ? cpu_initialize_context+0x263/0x280 [ 0.008829] [<ffffffff81128ce4>] ? cache_grow.clone.0+0x2b4/0x3b0 [ 0.008836] [<ffffffff8100a05f>] ? xen_restore_fl_direct_reloc+0x4/0x4 [ 0.008843] [<ffffffff810052f1>] ? pte_mfn_to_pfn+0x71/0xf0 [ 0.008851] [<ffffffff8145172b>] invalid_op+0x1b/0x20 [ 0.008857] [<ffffffff8143a229>] ? cpu_initialize_context+0x263/0x280 [ 0.008864] [<ffffffff8143a2e1>] xen_cpu_up+0x9b/0x115 [ 0.008870] [<ffffffff81440ad8>] _cpu_up+0x9c/0x10e [ 0.008876] [<ffffffff81440bbf>] cpu_up+0x75/0x85 [ 0.008882] [<ffffffff818998f1>] smp_init+0x46/0x9e [ 0.008888] [<ffffffff8188263c>] kernel_init+0x89/0x142 [ 0.008895] [<ffffffff814518b4>] kernel_thread_helper+0x4/0x10 [ 0.008901] [<ffffffff8144f973>] ? int_ret_from_sys_call+0x7/0x1b [ 0.008909] [<ffffffff81447d7c>] ? retint_restore_args+0x5/0x6 [ 0.008916] [<ffffffff814518b0>] ? gs_change+0x13/0x13
Nothing in xl dmesg or in Dom0's dmesg. Again, I really changed NOTHING -- it started crashing my VMs suddenly, and keeps occurring no matter what VM I'm starting...
xen-4.1.2-13
dom0 kernel: 3.2.7-6
comment:8 Changed 12 months ago by joanna
- Priority changed from critical to major
- Summary changed from Strange kernel bug upon VM start to Handle runtime Xen out of memory conditions in a user friendly way
- Type changed from defect to enhancement
As discussed here:
http://lists.xen.org/archives/html/xen-devel/2012-06/msg01314.html
... this was really an out of memory condition. As further discussed in the thread, it would be nice to have a generic way to handle such out of memory conditions in Qubes/Xen?. Renaming the ticket accordingly...
comment:9 Changed 12 months ago by marmarek
So what do you propose? I don't see any solution for handling such errors in above thread. Only some hints how to try to mitigate it, but all looks like much more work than it worth (like porting part of XenServer? toolstack or digging through full xen-unstable commit history).
Perhaps we should just increase xen free memory threshold in qmemman (currently 50MB) and/or investigate why xen_free_mem=0 happend.
comment:10 Changed 12 months ago by joanna
- Summary changed from Handle runtime Xen out of memory conditions in a user friendly way to Ensure Xen free_memory is always guaranteed to be above a fixed threshold
After some discussion we concluded that the best we can do for now (i.e. before switching to Xen 4.2, which is far on the horizon), is to ensure that Xen free_memory is always guaranteed to be above some threshold. Specifically we should change the qmemman logic so that it doesn't give any memory to VM's if this breaks the Xen free_memory condition even for a moment (in other words we should not be counting that memory could be recovered from a VM).
comment:11 Changed 12 months ago by joanna
- Component changed from kernel-dom0 to core
comment:12 Changed 12 months ago by marmarek
- Resolution set to fixed
- Status changed from reopened to closed

Ok, so this problem seemed to be caused by memmax set to 800 -- seems like one of the recent cores modified maxmem automatically for VMs that had mem set to some fixed value (which was the case for this VM, which I wanted to be excluded from dynamic mem balancing -- I assigned it 800MB). After I changed to the follwing:
mem = 800MB
maxmem = 4037MB
it starts fine again.