#162 closed defect (worksforme)
Improve memory management
| Reported by: | joanna | Owned by: | rafal |
|---|---|---|---|
| Priority: | major | Milestone: | Release 1 Beta 1 |
| Component: | core | Keywords: | |
| Cc: |
Description
- I keep experiencing a situation when Qubes complains it cannot start a new VM because of lack of memory, despite the fact there is enough memory in the system -- it's just generously spread among the few VMs that have been started previously (and which don't need it for sure).
- When running a lot of AppVMs (10+) for a long time, when suddenly the memory consumption goes up, the kernel in Dom0 might crash!!! This is likely due to OOM condition.
For the person that will be testing/debugging this -- please run 5-10 AppVMs, use Firefox + flash in each of it.
Change History (5)
comment:1 Changed 2 years ago by joanna
comment:2 Changed 2 years ago by joanna
- Owner changed from joanna to rafal
- Status changed from new to assigned
comment:3 Changed 2 years ago by rafal
and which don't need it for sure
If it is repeatable, please provide the output of "free" in each VM, including dom0, as well as /var/log/qubes/qmemman.log file.
it's just generously spread
I rather suspect it is done according to qmemman specification. If so, we may elect to change the specification, but the perfect solution simply does not exist.
kernel in Dom0 might crash!!! This is likely due to OOM condition.
If this is repeatable, try collecting kernel logs. There can be plethora of reasons. I will try this as well.
comment:4 Changed 2 years ago by rafal
- Resolution set to worksforme
- Status changed from assigned to closed
Cannot reproduce the crash.
On 4G RAM, I can spawn 8 AppVMs, each playing the same 26 minutes long youtube video. I cannot start 9th one because there is no memory.
All flash players have completed successfully. No sign of malfunction in any domain.
Closing until someone can provide better instructions on how to reproduce the crash, or log information allowing to determine whether qmemman misbehaves.
Some log/statistics:
qmemman.log:
dom 17 is below pref, allowing balance
dom 11 act/pref 336355328 319834112.0
dom 10 act/pref 335888384 319392153.6
dom 13 act/pref 335196160 318731878.4
dom 12 act/pref 334426112 318002380.8
dom 15 act/pref 312086528 296756428.8
dom 14 act/pref 322461696 306623283.2
dom 17 act/pref 313569280 318561484.8
dom 16 act/pref 339533824 322858598.4
dom 0 act/pref 1071656960 1019018035.2
xenfree= 56152064 balance req: [('11', 334418582), ('10', 333956471), ('13', 333266087), ('12', 332503324), ('15', 310288554), ('14', 320605338), ('16', 337580986), ('0', 1065485369), ('17', 333087923)]
mem-set domain 11 to 334418582
mem-set domain 10 to 333956471
mem-set domain 13 to 333266087
mem-set domain 12 to 332503324
mem-set domain 15 to 310288554
mem-set domain 14 to 320605338
mem-set domain 16 to 337580986
mem-set domain 0 to 1065485369
mem-set domain 17 to 333087923
xentop, somewhere in the middle of the replay:
xentop - 12:00:59 Xen 3.4.3
10 domains: 3 running, 3 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 3983468k total, 3928636k used, 54832k free CPUs: 4 @ 3192MHz
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_R
D VBD_WR SSID
Domain-0 -----r 1409 24.7 1040512 26.1 no limit n/a 4 0 0 0 0 0
0 0 0
netvm1 ------ 83 5.0 256000 6.4 409600 10.3 1 0 0 0 3 0 748
8 1673 0
test0 -----r 459 29.0 326128 8.2 3983360 100.0 4 0 0 0 3 6 1948
0 19701 0
test1 -----r 434 22.7 326580 8.2 3983360 100.0 4 0 0 0 3 14 2033
9 18974 0
test2 --b--- 228 13.0 324708 8.2 3983360 100.0 4 0 0 0 3 0 1015
4 6303 0
test3 --b--- 366 30.3 325452 8.2 3983360 100.0 4 0 0 0 3 30 1841
8 17889 0
test4 ------ 334 27.1 313088 7.9 3983360 100.0 4 0 0 0 3 52 2326
9 16143 0
test5 ------ 307 29.3 303016 7.6 3983360 100.0 4 0 0 0 3 8 2478
4 15680 0
test6 --b--- 266 30.4 329668 8.3 3983360 100.0 4 0 0 0 3 7 2485
1 14219 0
test7 ------ 212 31.2 325280 8.2 3983360 100.0 4 0 0 0 3 1 2647
8 11140 0
comment:5 Changed 2 years ago by rafal
One problem case has been deduced from the logs. Fix at
http://git.qubes-os.org/?p=rafal/core.git;a=commit;h=37e06d19e4339abb3cfd8e17b6c2b05cc73caef8

We should also reconsider running memory mgmt agent in Netvms -- with our new approach to handling resume, we should no longer fear about fragmentation in netvm.