#560 closed enhancement (worksforme)

Try some newer kernel for Dom0 and VM

Reported by: wikimaster Owned by: joanna
Priority: critical Milestone: Release 1
Component: kernel Keywords:
Cc:

Description

Problems with 3.2.7:

  • Doesn't support suspend on some systems (e.g. Vaio)
  • Problems with some USB modems in netvms (e.g. on my T420s)
  • udev lockup when booting on T420s that I started experience recently... (#551)

Ideally we could have some newer kernel and get rid complete of the old 2.6.38, with all its hundreds of god-knows-written-by-whom patches, and not-supported xenlinux architecture...

Change History (28)

comment:1 Changed 12 months ago by marmarek

3.3.5 in my kernel repo, devel-3.3 branch. There was many changes since 3.2, especially in ACPI S3 part, but fortunately Konrad provide branch in his repo with ACPI S3 patches for 3.3. It looks like 3.4 will fully support ACPI S3 OOTB :)

Tested on Samsung laptop and looks ok.

New nvidia driver (not commited to dom0-updates repo yet) now compiles cleanly, but still doesn't work - kernel messages looks like some infinite loop in interrupt handler... The same kernel on baremetal doesn't have this problem. Have not looked at this deeper yet.

comment:2 Changed 12 months ago by joanna

Just tried the 3.3.5-1 with xen 4.1.2-14, so without this commit:

http://10.141.1.101/?p=marmarek/xen.git;a=commitdiff;h=4bbba736120d67b7d7b7551d881f1f7e50ccbd55

My T420s hanged upon first S3 resume :/

comment:3 Changed 12 months ago by marmarek

This patch can (note that also kernel patch required) can help here - C-states can be the cause of freeze at suspend/resume. You can also try to disable cpuidle: don't remember exact parameters, but something like (AFAIR both required):
xen: no-cpuidle
dom0 kernel: processor.max_cstate=0 intel_idle.max_cstate=0

comment:4 Changed 12 months ago by marmarek

Perhaps this hang was introduced by msi-after-sleep.patch? Have you tried 3.3.5 kernel without this xen patch?

comment:5 Changed 12 months ago by joanna

Interestingly there is one more thing that breaks on this 3.3.5 kernel, compared to other dom0 kernel -- namely the gui daemon seems to have a horrible performance! This is easily visible when one opens up a browser with some long page, and tries to scroll up and down. The graphics lags significantly. When I switch back to the previous dom0 kernel, this problem vanishes... maybe it's connected with how Xen/Dom0 manages cpu, i.e. it slows it down too much?

comment:6 Changed 12 months ago by marmarek

Interesting, I don't see such symptoms on my system (with nvidia GPU on nouveau driver).
You can check power management parameters and statistics using xenpm tool.

comment:7 Changed 12 months ago by joanna

Later I could also see that with this 3.3.5 kernel in Dom0 some other graphics, specifically the plymouth splash blue screens, is also very slowly drawn. This suggests that there is some problem with supporting Intel Sandy Bridge integrated GPUs (strangely, OpenGL, as used by Dom0 KDE, seems to work fine).

Anyway, a natural question arises: how can we offer a user a choice of several different kernels for Dom0? If we just put them into our current yum repo, then you will automatically always pick the newest one, without offering an easy option to install any of the previous ones.

We could modify installer so that several select kernels are installed and grub offers a choice at boot (also for the installer). This would be good for hardware compatibility. Nevertheless, how can we prevent further kernel updates from removing the oldest ones?

comment:8 Changed 12 months ago by marmarek

As of multiple kernel: yum already have support for multiple kernels installed. Number of kernels installed simultaneously can be set via installonly_limit in yum.conf (3 by default).

What did you get from xenpm tool?

comment:9 Changed 12 months ago by joanna

Shall we increase the installonly limit to something bigger, say 16? I don't really see a reason to prevent user from trying a few kernels... Currently, with this being 3, we risk that a user removes a working kernel, while trying 3 more newer kernels...

comment:10 Changed 12 months ago by marmarek

The reason to keep this limit low is disk space. We have 500MB in /boot (one kernel takes about 20MB), so should be no problem here. But modules take about 130MB, which can be significant with number like 10 (especially when using not-so-big SSD drive).
IMHO we can increase by default it to somehow like 5, but not more. If the user wants to experiment with different kernel versions, can increase this limit manually.

BTW I've just pushed devel-3.4 branch with brand new 3.4 kernel (+some patches - ACPI S3 under Xen still isn't in upstream). I'd some problems in VM on this kernel ("null paging request" while loading iptables modules), but cannot reproduce... Beside that looks good (especially in dom0). Maybe it fixes some problems mentioned in original ticket description and/or Intel GPU performance issue?

comment:11 Changed 12 months ago by joanna

I just tried the 3.4 kernel (+ Xen with "xen: allow dom0 to update C-/P-/T- state management info" patch). The good news is that suspend seems to work fine now on my T420s. The bad news is that the graphics performance is just as poor as on the 3.3. kernel, as reported above. I played with xenpm and made sure to set the following:
1) xenpm set-scaling-governor performance
2) xenpm set-max-cstate 0

I have verified then that my processor is: 1) keeps staying in P0 state (so, max frequenccy), and 2) keeps staying in C0 state.

Those setting didn't change anything regarding the graphics performance, which is really visibly poor (e.g. when scrolling large websites in a browser).

Is there any other setting in xenpm I could try?

comment:12 Changed 12 months ago by joanna

Ah, one additional side effect that seems to be related to running 3.4 as dom0 kernel (or perhaps xen with this C/P/T patch?) is that I keep getting the following crash in various VMs (which run 3.2.7-4 kernel):

[   21.018929] BUG: unable to handle kernel paging request at 00000000988c9400
[   21.018943] IP: [<ffffffffa00b3e82>] ____nf_conntrack_find+0x2/0x180 [nf_conntrack]
[   21.018958] PGD 11a79067 PUD 0 
[   21.018964] Oops: 0002 [#1] SMP 
[   21.018970] CPU 3 
[   21.018973] Modules linked in: bnep bluetooth rfkill ipt_REJECT xt_state xt_tcpudp iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables xen_netfront pcspkr u2mfn(O) xen_blkback xen_evtchn autofs4 ext4 jbd2 crc16 dm_snapshot xen_blkfront [last unloaded: scsi_wait_scan]
[   21.019017] 
[   21.019021] Pid: 1006, comm: firefox Tainted: G           O 3.2.7-4.pvops.qubes.x86_64 #1  
[   21.019029] RIP: e030:[<ffffffffa00b3e82>]  [<ffffffffa00b3e82>] ____nf_conntrack_find+0x2/0x180 [nf_conntrack]
[   21.019041] RSP: e02b:ffff8800114878c0  EFLAGS: 00010282
[   21.019045] RAX: 00000000988c9400 RBX: ffff880011487938 RCX: 00000000988c94c1
[   21.019051] RDX: ffff880011487938 RSI: 0000000000000000 RDI: ffffffff819e6ec0
[   21.019057] RBP: ffff8800114878f8 R08: 0000000020a9a821 R09: 000000007fad0e16
[   21.019063] R10: ffff880011487950 R11: 0000000000000000 R12: 0000000000000000
[   21.019069] R13: 00000000988c94c1 R14: ffffffff819e6ec0 R15: 0000000000000000
[   21.019079] FS:  00007fe0784f6700(0000) GS:ffff880018fad000(0000) knlGS:0000000000000000
[   21.019086] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[   21.019092] CR2: 00000000988c9400 CR3: 0000000010a55000 CR4: 0000000000002660
[   21.019098] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   21.019104] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   21.019111] Process firefox (pid: 1006, threadinfo ffff880011486000, task ffff8800114be440)
[   21.019117] Stack:
[   21.019121]  ffffffffa00b403d ffff8800114878f8 ffff8800118382c0 0000000000000000
[   21.019132]  ffffffff819e6ec0 0000000000000000 0000000000000002 ffff8800114879a8
[   21.019142]  ffffffffa00b44a2 ffffffffa00d1420 ffffffffa00be640 ffffffffa00be640
[   21.019153] Call Trace:
[   21.019162]  [<ffffffffa00b403d>] ? __nf_conntrack_find_get+0x3d/0x180 [nf_conntrack]
[   21.019173]  [<ffffffffa00b44a2>] nf_conntrack_in+0x2d2/0x6a0 [nf_conntrack]
[   21.019183]  [<ffffffff8139d6e8>] ? ip_generic_getfrag+0x88/0xa0
[   21.019191]  [<ffffffffa00d0689>] ipv4_conntrack_local+0x49/0x50 [nf_conntrack_ipv4]
[   21.019200]  [<ffffffff813910d5>] nf_iterate+0x85/0xb0
[   21.019207]  [<ffffffff8139be70>] ? ip_options_build+0x210/0x210
[   21.019214]  [<ffffffff8139125d>] nf_hook_slow+0x6d/0x140
[   21.019221]  [<ffffffff8139be70>] ? ip_options_build+0x210/0x210
[   21.019228]  [<ffffffff8139e0fe>] __ip_local_out+0x9e/0xa0
[   21.019234]  [<ffffffff8139e431>] ip_local_out+0x11/0x30
[   21.019240]  [<ffffffff8139e466>] ip_send_skb+0x16/0x50
[   21.019248]  [<ffffffff813bfe98>] udp_send_skb+0x108/0x390
[   21.019255]  [<ffffffff81396129>] ? ipv4_dst_check+0x39/0x40
[   21.019262]  [<ffffffff8139d660>] ? ip_append_page+0x530/0x530
[   21.019270]  [<ffffffff813c107d>] udp_sendmsg+0x2ed/0x8a0
[   21.019278]  [<ffffffff8100a05f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[   21.019288]  [<ffffffff81128997>] ? kmem_cache_alloc+0x77/0x110
[   21.019296]  [<ffffffff813ca0b3>] inet_sendmsg+0x43/0xb0
[   21.019303]  [<ffffffff81352884>] sock_sendmsg+0xe4/0x110
[   21.019309]  [<ffffffff81396d8e>] ? ip_route_output_slow+0x1ae/0x520
[   21.019321]  [<ffffffff81065462>] ? local_bh_enable_ip+0x22/0xa0
[   21.019331]  [<ffffffff81447a80>] ? _raw_spin_unlock_bh+0x10/0x20
[   21.019338]  [<ffffffff813561fb>] ? release_sock+0xdb/0x110
[   21.019345]  [<ffffffff81352a04>] sys_sendto+0x104/0x140
[   21.019356]  [<ffffffff8103a2a8>] ? pvclock_clocksource_read+0x58/0xd0
[   21.019362]  [<ffffffff81009e60>] ? xen_clocksource_read+0x20/0x30
[   21.019369]  [<ffffffff81009ff9>] ? xen_clocksource_get_cycles+0x9/0x10
[   21.019378]  [<ffffffff81088a52>] ? getnstimeofday+0x52/0xe0
[   21.019386]  [<ffffffff8144f752>] system_call_fastpath+0x16/0x1b
[   21.019391] Code: 00 01 00 00 09 bc 0b a0 00 00 00 50 e8 0b a0 00 00 a8 43 61 18 e1 00 00 00 01 00 00 00 02 00 00 00 40 06 00 00 00 00 00 00 b0 00 <00> 00 00 00 00 00 01 00 00 00 02 00 00 00 30 05 00 00 00 00 00 
[   21.019473] RIP  [<ffffffffa00b3e82>] ____nf_conntrack_find+0x2/0x180 [nf_conntrack]
[   21.019487]  RSP <ffff8800114878c0>
[   21.019491] CR2: 00000000988c9400
[   21.019496] ---[ end trace 24141b02a27a02d2 ]---

comment:13 Changed 12 months ago by marmarek

The crash can be caused by latest SKB slots patch... Don't have any other idea how change in dom0 kernerl can cause VM kernel crashes, especially in network subsystem, which are totally isolated.

comment:15 Changed 12 months ago by joanna

Finally I managed to reproduce the hang on my 2.3.7-3 Dom0 kernel:

Key slot 0 unlocked.
                Welcome to Qubes
                Press 'I' to enter interactive startup.
Starting udev: udevd-work[816]: '/usr/bin/vmmouse_detect' unexpected exit with status 0x000b

udevd-work[810]: '/usr/bin/vmmouse_detect' unexpected exit with status 0x000b

udevd[802]: worker [850] unexpectedly returned with status 0x0100

udevd[802]: worker [850] failed while handling '/devices/virtual/block/loop114'

udevd[802]: worker [1421] unexpectedly returned with status 0x0100

udevd[802]: worker [1421] failed while handling '/devices/virtual/block/loop192'

udevd[802]: worker [809] unexpectedly returned with status 0x0100

udevd[802]: worker [809] failed while handling '/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1'

udevd[802]: worker [838] unexpectedly returned with status 0x0100

udevd[802]: worker [838] failed while handling '/devices/virtual/block/loop201'

udevd[802]: worker [872] unexpectedly returned with status 0x0100

udevd[802]: worker [872] failed while handling '/devices/virtual/block/loop205'

udevd[802]: worker [875] unexpectedly returned with status 0x0100

udevd[802]: worker [875] failed while handling '/devices/virtual/block/loop206'

udevd[802]: worker [876] unexpectedly returned with status 0x0100

udevd[802]: worker [876] failed while handling '/devices/virtual/block/loop207'

udevd[802]: worker [885] unexpectedly returned with status 0x0100

udevd[802]: worker [885] failed while handling '/devices/virtual/block/loop209'

udevd[802]: worker [1058] unexpectedly returned with status 0x0100

udevd[802]: worker [1058] failed while handling '/devices/virtual/block/loop212'

udevd[802]: worker [1158] unexpectedly returned with status 0x0100

udevd[802]: worker [1158] failed while handling '/devices/virtual/block/loop217'

comment:16 Changed 12 months ago by marmarek

Perhpaps the reason is slowness of xenstore. Our udev handler expose block devices to xenstore, but also remove its when decided to not export some device - regardless if it was exported earlier or not. In dom0 we have 256 loop devices, which are simultaneously handled by udev, which produce at least that amount of simultaneous xenstore-rm calls (empty loop devices are hidden from qvm-block). Maybe even this is some deadlock in xenstore...

I've just modified udev handler to save information if device is exported to xenstore and call xenstore-rm only when necessary (will push it in the near future).

comment:17 Changed 11 months ago by joanna

Have you pushed?

comment:18 Changed 11 months ago by marmarek

Into gitpro - yes. Not to public git yet - will do it later today.

comment:19 Changed 11 months ago by marmarek

Above mentioned commit: http://git.qubes-os.org/?p=marmarek/core.git;a=commit;h=3a8427cee57cab2a0f10c00586a8ccd967462aa5

Also pushed 3.4.2 to devel-3.4 branch. 3.4 up to 3.4.1 had rather strange problem which manifest itself in messing up pages read from loop device. On my test system it causes crashes during netvm boot because of loading broken modules (reads from /lib/modules returned messed up files content!). Details: http://lists.xen.org/archives/html/xen-devel/2012-06/msg00537.html

But still, I suspect nothing new here in terms of intel gpu performance.

comment:20 Changed 11 months ago by joanna

Sadly, this commit didn't help with udev on my kernel. It still tends to hang on udev during boot for a few minutes...

comment:21 Changed 11 months ago by joanna

  • Priority changed from major to critical

comment:22 Changed 11 months ago by marmarek

Can you catch which udev action causes this (and if it really is udev action)? Perhaps some sysrq will help, or some additional detail in logfile. Above udev messages didn't contain info which action failed...
Did you tried 3.4.2?
Regarding GPU performance, I can revert commit pointed by Radoslaw Szkodzinski, but can't predict side effects... Maybe it worth a try?

comment:23 Changed 11 months ago by marmarek

I've pushed 3.4.4 (which contains some minor fixes) and also added fix for GPU performance (reverted commit pointed by R.Szkodzinski and applied correct fix from Konrad's git repo).
In dom0 running this kernel I see constantly load>=1.0, but no process consume CPU time... Besides that everything seems to be working.

comment:24 Changed 11 months ago by joanna

The GPU performance seems ok on the 3.4.4 kernel :)

comment:25 Changed 11 months ago by joanna

Regarding the 3.4.4 kernel running in a VM -- I constantly observe my WiFi? card gets stuck and effectively defunt in the netvm:

[10658.740087] iwlwifi 0000:00:02.0: Queue 12 stuck for 10000 ms.
[10658.740099] iwlwifi 0000:00:02.0: Current SW read_ptr 58 write_ptr 154
[10658.740175] iwlwifi 0000:00:02.0: Current HW read_ptr 58 write_ptr 154
[10658.740185] iwlwifi 0000:00:02.0: On demand firmware reload
[10658.740771] ieee80211 phy0: Hardware restart was requested
[10658.740958] iwlwifi 0000:00:02.0: L1 Enabled; Disabling L0S
[10658.741157] iwlwifi 0000:00:02.0: Radio type=0x1-0x2-0x0

This is seen by the user as networking not working anymore. The solution is to click disconnect and then connect again in the NM applet.

comment:26 Changed 11 months ago by joanna

It seems like 3.4.4 works fine as a Dom0 kernel on my laptop...

comment:27 Changed 11 months ago by joanna

... actually, the 3.4.4 has one problem when I run it as Dom0 on my laptop -- namely, quite often, the kscreenlocker or kwin becomes somehow defunct after S3 resume, and I need to manually go to a text console, log in, and then kill the kwin, switch to X, and restart the kwin in order to be able to continue the work under X. It might be something GPU-related, perhaps.

In any case, I think we should choose 3.2.7-7 as both the Dom0 and the default VM kernel. The 3.2.7 branch got most testing with Qubes so far...

comment:28 Changed 10 months ago by joanna

  • Resolution set to worksforme
  • Status changed from new to closed

If we all agree, I will close this ticket.

Note: See TracTickets for help on using tickets.