Opened 2 years ago
Closed 2 years ago
#146 closed defect (fixed)
Fix suspend netvm
| Reported by: | marmarek | Owned by: | marmarek |
|---|---|---|---|
| Priority: | major | Milestone: | Release 1 Beta 2 |
| Component: | core | Keywords: | |
| Cc: |
Description
After suspend, network driver cannot allocate PCI memory.
Change History (14)
comment:1 Changed 2 years ago by marmarek
- Owner changed from joanna to marmarek
- Status changed from new to assigned
comment:2 Changed 2 years ago by joanna
- Milestone set to Release 1 Beta 1
comment:3 Changed 2 years ago by marmarek
- Status changed from assigned to accepted
Running pm-scripts in netvm fixes the problem.
comment:4 Changed 2 years ago by marmarek
- Resolution set to fixed
- Status changed from accepted to closed
Done.
http://git.qubes-os.org/gitweb/?p=marmarek/core.git;a=commit;h=2bcbc1742ea68dae5b55d7e5cdb3b65a0befae4a
http://git.qubes-os.org/gitweb/?p=marmarek/core.git;a=commit;h=464337a24e1279b99dec2abfe6cd90d69ceeddcf
http://git.qubes-os.org/gitweb/?p=marmarek/core.git;a=commit;h=c2e0a84c222be070449c6d679ed43d0f0f48759e
comment:5 Changed 2 years ago by joanna
- Priority changed from major to critical
- Resolution fixed deleted
- Status changed from closed to reopened
The script doesn't work as expected!
First of all, the qvm-get-default-netvm script, under normal circumstances, would returns firewallvm, not the actual hardware-attached netvm!
I think we should also retrieve _all_ currently running netvms and apply the command to all of them, not just the default one (the user might e.g. have two netvms for each hardware NIC).
Also, I see the following error in /var/log/pm-suspend.log:
/usr/lib64/pm-utils/sleep.d/01qubes-suspend-netvm resume suspend: /usr/lib64/pm-utils/sleep.d/01qubes-suspend-netvm: line 19: [: missing `]' method return sender=:1.46 -> dest=:1.45 reply_serial=2
comment:6 Changed 2 years ago by joanna
After I have manually hardcoded NETVM=netvm in the pmscript, and also after manually adding the missing whitespace before the closeing bracket, the script still doesn't work as expected. Specifically, my netvm consumes some 99% of CPU for about 30sec after resume, during which time it outputs lots of dramatic messages into its dmesg (MAC in deep sleep! Hardware Error! etc). After this 30s or so, it reinitializes the NIC and all is fine...
I'm using iwlagn.
comment:7 Changed 2 years ago by joanna
In case you like some fetish:
[ 2410.492733] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2410.545825] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2410.593919] iwlagn 0000:00:01.0: Desc Time data1 data2 line [ 2410.593924] iwlagn 0000:00:01.0: ADVANCED SYSASSERT (#-1515870816) 2779096480 0xA5A5A5A0 0xA5A5A5A0 2779096480 [ 2410.593927] iwlagn 0000:00:01.0: blink1 blink2 ilink1 ilink2 [ 2410.593929] iwlagn 0000:00:01.0: 0xA5A5A5A0 0xA5A5A5A0 0xA5A5A5A0 0xA5A5A5A0 [ 2410.593933] iwlagn 0000:00:01.0: CSR values: [ 2410.593935] iwlagn 0000:00:01.0: (2nd byte of CSR_INT_COALESCING is CSR_INT_PERIODIC_REG) [ 2410.593964] iwlagn 0000:00:01.0: CSR_HW_IF_CONFIG_REG: 0X00000000 [ 2410.593989] iwlagn 0000:00:01.0: CSR_INT_COALESCING: 0X0000ff00 [ 2410.594014] iwlagn 0000:00:01.0: CSR_INT: 0X20000000 [ 2410.594039] iwlagn 0000:00:01.0: CSR_INT_MASK: 0X00000000 [ 2410.594064] iwlagn 0000:00:01.0: CSR_FH_INT_STATUS: 0X00000000 [ 2410.594089] iwlagn 0000:00:01.0: CSR_GPIO_IN: 0X0000000f [ 2410.594114] iwlagn 0000:00:01.0: CSR_RESET: 0X00000002 [ 2410.594139] iwlagn 0000:00:01.0: CSR_GP_CNTRL: 0X080003d0 [ 2410.594164] iwlagn 0000:00:01.0: CSR_HW_REV: 0X00000074 [ 2410.594189] iwlagn 0000:00:01.0: CSR_EEPROM_REG: 0X00000000 [ 2410.594214] iwlagn 0000:00:01.0: CSR_EEPROM_GP: 0X90000001 [ 2410.594240] iwlagn 0000:00:01.0: CSR_OTP_GP_REG: 0X00030001 [ 2410.594265] iwlagn 0000:00:01.0: CSR_GIO_REG: 0X00080040 [ 2410.594290] iwlagn 0000:00:01.0: CSR_GP_UCODE_REG: 0X00000000 [ 2410.594315] iwlagn 0000:00:01.0: CSR_GP_DRIVER_REG: 0X00000000 [ 2410.594341] iwlagn 0000:00:01.0: CSR_UCODE_DRV_GP1: 0X00000000 [ 2410.594366] iwlagn 0000:00:01.0: CSR_UCODE_DRV_GP2: 0X00000000 [ 2410.594391] iwlagn 0000:00:01.0: CSR_LED_REG: 0X00000018 [ 2410.594416] iwlagn 0000:00:01.0: CSR_DRAM_INT_TBL_REG: 0X00000000 [ 2410.594442] iwlagn 0000:00:01.0: CSR_GIO_CHICKEN_BITS: 0X27800200 [ 2410.594467] iwlagn 0000:00:01.0: CSR_ANA_PLL_CFG: 0X00000000 [ 2410.594492] iwlagn 0000:00:01.0: CSR_HW_REV_WA_REG: 0X0001001a [ 2410.594517] iwlagn 0000:00:01.0: CSR_DBG_HPET_MEM_REG: 0X82000510 [ 2410.594520] iwlagn 0000:00:01.0: FH register values: [ 2410.597907] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2410.647533] iwlagn 0000:00:01.0: FH_RSCSR_CHNL0_STTS_WPTR_REG: 0Xa5a5a5a0 [ 2410.651522] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2410.700440] iwlagn 0000:00:01.0: FH_RSCSR_CHNL0_RBDCB_BASE_REG: 0Xa5a5a5a0 [ 2410.704429] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2410.753333] iwlagn 0000:00:01.0: FH_RSCSR_CHNL0_WPTR: 0Xa5a5a5a0 [ 2410.757320] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2410.806298] iwlagn 0000:00:01.0: FH_MEM_RCSR_CHNL0_CONFIG_REG: 0Xa5a5a5a0 [ 2410.810286] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2410.858716] iwlagn 0000:00:01.0: FH_MEM_RSSR_SHARED_CTRL_REG: 0Xa5a5a5a0 [ 2410.862703] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2410.911470] iwlagn 0000:00:01.0: FH_MEM_RSSR_RX_STATUS_REG: 0Xa5a5a5a0 [ 2410.915459] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2410.964485] iwlagn 0000:00:01.0: FH_MEM_RSSR_RX_ENABLE_ERR_IRQ2DRV: 0Xa5a5a5a0 [ 2410.968475] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2411.017786] iwlagn 0000:00:01.0: FH_TSSR_TX_STATUS_REG: 0Xa5a5a5a0 [ 2411.021772] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2411.070656] iwlagn 0000:00:01.0: FH_TSSR_TX_ERROR_REG: 0Xa5a5a5a0 [ 2411.074644] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2411.127341] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2411.178625] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2411.230143] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2411.277577] iwlagn 0000:00:01.0: Log capacity -1515870816 is bogus, limit to 512 entries [ 2411.277579] iwlagn 0000:00:01.0: Log write index -1515870816 is bogus, limit to 512 [ 2411.277581] iwlagn 0000:00:01.0: Start IWL Event Log Dump: display last 20 entries [ 2411.281572] iwlagn 0000:00:01.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0x080003D8 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.281572] iwlagn 0000:00:01.0: EVT_LOGT:2779096480:0xa5a5a5a0:2779096480 [ 2411.329946] iwlagn 0000:00:01.0: Hardware error detected. Restarting. ...
Again, after this 30sec of those dramatic complaining, the interface goes back to normal and is usable... But this works regardless of the new pm script. Perhaps there is still something wrong with this script?
comment:8 Changed 2 years ago by joanna
Ha! Indeed, after I replaced the mysterious arguments to qvm-run (the command to be executed on the netvm) with plain stupid:
/etc/init.d/NetworkManager stop
and
/etc/init.d/NetworkManager start
Now it works! (Although the NM restart causes ip_forward to be zeroed...)
So, we need a solution that:
1) Actually works! (NM start/stop seems to work, the other one not)
2) Targets all running hardware-attached netvms in the system
3) Takes care about ip_forward in case we decided to restart NM directly.
comment:9 Changed 2 years ago by marmarek
- Resolution set to fixed
- Status changed from reopened to closed
comment:10 Changed 2 years ago by rafal
- Milestone changed from Release 1 Beta 1 to Release 1 Beta 2
- Priority changed from critical to minor
- Resolution fixed deleted
- Status changed from closed to reopened
Unfortunately, it seems that stopping NetworkManager? does not quarantee that all interfaces are down. In one problem case netvm uses eth0, and during pm-suspend NM prints
Apr 11 14:28:48 localhost NetworkManager?[1358]: <info> caught signal 15, shutting down normally.
Apr 11 14:28:49 localhost NetworkManager?[1358]: <info> (wlan0): now unmanaged
Apr 11 14:28:49 localhost NetworkManager?[1358]: <info> (wlan0): device state change: 3 -> 1 (reason 36)
Apr 11 14:28:49 localhost NetworkManager?[1358]: <info> (wlan0): cleaning up...
Apr 11 14:28:49 localhost NetworkManager?[1358]: <info> (wlan0): taking down device.
Apr 11 14:28:49 localhost NetworkManager?[1358]: <info> exiting (success)
apparently, nothing about eth0. After resume, eth0 does not work - one needs to ifconfig down, then up manually.
comment:11 Changed 2 years ago by marmarek
Apparently "service NetworkManager? stop" and "service NetworkManager? start" is also sufficient to repair eth0.
comment:12 Changed 2 years ago by rafal
- Priority changed from minor to major
Actually, I believe that not downing eth0 may lead to hang during suspend. I experienced it once now, it has never happened before with pci-detach mechanism. Raising to major.
comment:13 Changed 2 years ago by marmarek
- Status changed from reopened to accepted
comment:14 Changed 2 years ago by marmarek
- Resolution set to fixed
- Status changed from accepted to closed

Instead of pci-detach, we can down interface before suspend. Ex stop NetworkManager? before suspend (with qvm-run), and start it after resume.