TransWikia.com

Suspend hangs after GPU is used. How to troubleshoot?

Super User Asked by Pandian Le on November 18, 2020

Problem

On normal occasions when xorg and compiz is running in my gpu, I
can Suspend peacefully. However if I run some intense (90% GPU
in use) training (via jupyter) related to pytorch, and subsequently
suspend after the processes are over, it refuses to sleep/wakeup.

I am positive GPU being full or not empty is causing the issue. I
don’t know why "some process" possibly related to the GPU is not
Suspending. When I run jupyter and run 1+1 (or a simple process)
and Suspend, then also no issues.

Question

Kernlog shows me nothing "fishy". I have tried a bunch of online
remedies. Now at a dead end.

How do I identify what is happening? any ideas?

Other symptoms

It sort-of sleeps but I still hear some sound from the laptop when I
hit a key (it sounds as if it is booting up). And then blank screen
after that. Sometimes I get to go to the TTY but can’t type anything.


My system

  • Ubuntu 16.04
  • Nvidia 1050 GeForce
  • Acer nitro 5 8gb ram

What all I tried to rectify this issue?

Spent a good 5 full days understanding and searching and re-installing
etc… Now at a dead end.

  1. Checked the kern logs (pastebin link) but didn’t see anything
    "fishy". (at 02:08 I start sleeping and at 10:21 I hit hard
    reset).

    Here is a tiny exerpt:

Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.6443] manager: sleep requested (sleeping: no  enabled: yes)
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.6443] manager: sleeping...
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.6447] manager: NetworkManager state is now ASLEEP
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.6453] device (wlp2s0): state change: activated -> deactivating (reason 'sleeping') [100 110 37]
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.8169] device (wlp2s0): state change: deactivating -> disconnected (reason 'sleeping') [110 30 37]
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.8356] dhcp4 (wlp2s0): canceled DHCP transaction, DHCP client pid 8328
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.8356] dhcp4 (wlp2s0): state changed bound -> done
Oct  2 02:08:06 eghx-nitro NetworkManager[8152]: <info>  [1601597286.8363] dns-mgr: Writing DNS information to /sbin/resolvconf
Oct  2 02:08:06 eghx-nitro kernel: [24100.153393] wlp2s0: deauthenticating from e8:cc:18:41:3c:15 by local choice (Reason: 3=DEAUTH_LEAVING)
Oct  2 02:08:07 eghx-nitro NetworkManager[8152]: <warn>  [1601597287.0509] sup-iface[0xb4a6f0,wlp2s0]: connection disconnected (reason -3)
Oct  2 02:08:07 eghx-nitro NetworkManager[8152]: <info>  [1601597287.0511] device (wlp2s0): supplicant interface state: completed -> disconnected
Oct  2 02:08:07 eghx-nitro NetworkManager[8152]: <info>  [1601597287.0525] device (wlp2s0): state change: disconnected -> unmanaged (reason 'sleeping') [30 10 37]
Oct  2 02:08:08 eghx-nitro kernel: [24101.983885] PM: suspend entry (deep)
Oct  2 02:08:09 eghx-nitro kernel: [24101.983888] PM: Syncing filesystems ... done.
Oct  2 10:21:32 eghx-nitro kernel: [24103.953554] Freezing user space
processes ... (elapsed 0.002 seconds) done.
  1. Based on Nvidia forum added the following to grub and updated.

     GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi_rev_override=1
     acpi_osi=Linux scsi_mod.use_blk_mq=1 nouveau.modeset=0
     nouveau.runpm=0 mem_sleep_default=deep"
    

    Added the following to iniramfs-tools/modules and updated.

     nvidia
     nvidia_modeset
     nvidia_uvm
     nvidia_drm
    
  2. Didn’t change kernel as there was no evidence towards it. People
    changed to 4.17. Mine is currently 4.15.

  3. Blind try: Trying different (Suspend)s

     systemctl suspend
    
     pm-suspend
    
  4. Tried downgrading the drivers to 384 from 430 with changing it at
    additional drivers. This was not useful as this was not capable
    of co-existing with pytorch=1.6.0

  5. Complete remove and re-install of nvdia-430 as per here:
    purge, add-apt-repository ppa:graphics-drivers/ppa, update
    and autoinstall.

    This ended in the black screen of death. Recovered it with
    noveau.modeset=0. Somehow GPU was not working anymore.

  6. At this point did a complete re-install of xserver,unity,
    lightdm and nvidia-430 over tty terminal before login screen.

    This recovered the system to it’s previous state i.e., suspend
    when GPU full hangs the system.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP