Improving deployed executable startup/reboot/longevity reliability

Peter_B · ‎12-29-2023

Hi fellow wireworkers,

I have success running my deployed LV startup executable on the RPi Zero 2 W continuously for many days (at least 9 to be exact). I now want to solve two problems:

Startup after a cold start or reboot of RPi is not consistent in that about 30% of the time the executable fails to run, sometimes failing to launch 3-4 reboots in a row . My reboot method is either via RPi GUI (shutdown/reboot) or via a forced watchdog reboot or sometimes removing/replugging USB cable to powercycle. (my app is not doing any SD Card writes at the time). Restarting the LabVIEW service via the LV IDE \Utilities\Restart menu I *think* gives more reliable restarts (I still need to quantify how reliable though)
Automatically recovering from a lock up of the startup executable after many days of stable running.

On point 1. , is that a common situation that other developers here see ? Any thoughts on how I could improve the reliability because I need it to recover automatically, i.e. if a reboot fails to launch the startup EXE I must keep auto-rebooting until it launches OK.

for 2. I am pursuing two possible ideas

a) Use the RPi internal watchdog. I have succeeded to install and enable the watchdog per many other RPi centric articles e.g. this. My next step is for my LV exe to to disable the wd_keepalive daemon service (using the SSH trick and sudo commands etc). and then have the exe pat/tickle the watchdog every 14 seconds. OR

b) Write my own watchdog (BASH script or even a second LV exe running in parallel at startup if possible) that just restarts the LabVIEW service if the main app doesn't tickle it. It would use the linux command that achieves the same as the \Utilities\Restart I mentioned above) :

sudo systemctl restart labview.service

any thoughts on a) or b) ?

thanks

Peter

AndyLB · ‎12-30-2023

Hi Peter,

Is your application using the network interface?

On power-up, it is possible that your start-up application is running before the Raspberry Pi has started the network device and obtained an IP address?

There is an option in raspi-config to wait for network at boot. It may be worth enabling this.

Check that all the inputs to your main vi have default values assigned.

Peter_B · ‎12-30-2023

FWIW I have located and am attaching LabVIEW's failure log located at /srv/chroot/labview/var/local/natinst/log/LabVIEW_Failure_Log.root.txt

The failure description is always "LabVIEW caught fatal signal" and the 2 possible failure Reasons are:

####
#Date: Thu, Dec 28, 2023 08:33:45 AM
#Desc: LabVIEW caught fatal signal
21.0 - Received SIGSEGV
Reason: invalid permissions for mapped object

....

####
#Date: Fri, Dec 29, 2023 06:14:41 AM
#Desc: LabVIEW caught fatal signal
Reason: address not mapped to object

Peter

Peter_B · ‎12-30-2023

HI Andy, thanks for replying ! My other reply with the log file attached , I sent before I realised you had replied.

I'm assuming the failure at times to run after restart behaviour I see is not expected or at least unusual.

to answer your questions:

>Is your application using the network interface?

no it isn't. The main things it does prior to the RPi having fully booted is to

creating and writing to text files at /srv/chroot/labview/usr/local/*.txt
set some GPIO lines
poll the USB serial port for data

>Check that all the inputs to your main vi have default values assigned.

ok, that is a suggestion out of left field ! The Main vi only has one input to it per below called Module Admin . Don't default value exist for all controls and indicators ?

Peter

rolfk · ‎12-30-2023

I would definitely try to wait for full startup before doing much serious work. Especially the GPIO looks suspicious. Before the host OS is fully started, the mapping of the device entries in the chroot may actually point into non existing device entries on the host. File system could also be affected but I would expect that to be a bit more robust and not just cause a fatal SIGSEGV if not yet fully available.

Since it works much of the time the needed delay may be relatively short and could be just hardwired but you could also try the new VI to check for network availability maybe, that is present since 2019 or so.

Rolf Kalbermatter
My Blog

Peter_B · ‎12-30-2023

Thank you for chiming in Rolf !

From your reply it does seem even a 30% restart failure rate is unexpected. I note the executable runs and thus sets GPIO lines at about the 20s mark post restart and about 5-10 seconds before the RPi's Desktop outputs on its display port. So I will begin by inserting a very large delay of say 30 to 60 seconds just to see if it solves the problem. If it does then I can work backwards from there to better time when I can give my exe the freedom it wants (e.g. using the network availability vi or by inspecting other status indicators of the O/S yet to be elucidated).

I may revisit my watchdog question once I have solved the reboot issue given the latter must be reliable before I implement the former.

Peter

Peter_B · ‎01-02-2024

The error is a SEGMENTATION FAULT ! That is causing Linux to abort the LabVIEW Service. I know that was evidenced to above in my error logs, but I'm only learning more about possible causes.

I didn't realise why for the past couple of months in the dev environment why I had to randomly select to reset the connection (i.e. restart the LabVIEW service under the hood). Contrary to above reports, this is not just happening at startup for me, but through intentional stressing of the system and I am now able to getting SIGSEGV segmentation fault errors coincident to every say 10 or 20th toggling of some GPIO pins (and writing to a file) - but most likely GPIO R/W actions.

I am narrowing down the culprit I hope, but it has been a learning experience to get there. The more experienced among you will realise how unusual it is for a native LabVIEW program to generate a segmentation fault. Unlikely to be a bug of the programmer unless they are calling an external library and use incorrect addressing (not my situation).

I am not in the office again for another 10-12 hours (to physically press the push buttons a multiple random # of times to eventually trigger the fault) but I am looking fwd to trying out this "fix" which someone else stumbled across for the RPi 3 some 7 years ago ! Basically disabling the Remote GPIO feature of the RPi was the fix. I hope this is the problem otherwise I must dig deeper. I'll keep you posted.

Peter

Peter_B · ‎01-04-2024

I found the main culprit causing BSOD equivalant crashes during execution ! (It was NOT solved by disabling the Remote GPIO setting in the RPi) The culprit - I was calling the following code (Open,Write GPIO,Close) twice in succession without any delay between each successive call.

Inserting a delay of at least 15-20ms solved the "BSOD" equivalent I was getting - a segmentation fault LabVIEW service crashing bug.

Now you might be asking why I do an open and close for my GPIO writes (and I'm inclined to change it). It was because I was doing multiple asynchronous write in parallel. So I initially solved the problem by having the above atomic operation - as I found I could not have multiple instance of the Open Local.vi in my code, I could only have one instance so I would have then had to share the LINX resource reference globally amongst all my asynchronous modules - something I put aside at the time. I suspect/hope that if I was to do that I could then call the Digital Write.vi much more often than 10-15ms, in fact, perhaps I should be able to call it with no delay and it should n ot cause this RPi BSOD ! I am still experiencing this a little at startup per my original post so I will soon go through my init routines to insert delays to confirm I can now solve both problems and the need for a watchdog will drop to a very low priority.

BTW as an aside, it takes only 3 seconds to restart the LabVIEW service and have the startup.rtexe re-running - that is nice and fast. It does take much longer to restart the RPi (like 20-ish seconds before the startup.rtexe is running, so if I ever need to do a watchdog I will do it at two levels, one to restart the LabVIEW service if it or my app stops and then another watchdog to restart the RPi if the first method somehow fails.

My next and hopefully last report back here will be to confirm I solved the startup BSOD that currently happens about 30% of the time and I suspect has the same root cause.

Peter

Peter_B · ‎01-05-2024

A mix of good and bad news. Good news is that I had some GPIO writes (per my last reply) at startup causing the segmentation fault error, so I solved those

Bad news is that now the remaining startup error seems to be outside of my control (as I even deleted the startup.rtexe file and still get the error). When I perform a cold reboot (or possibly just a restart), about 1 every 3 times, Linux is unable to mount various directories reporting

"....E: 10mount: mount: /run/schroot/mount/lv/dev/pts: not mount point or bad option....." or
"...E: 10mount: mount: /run/schroot/mount/lv/proc: not mount point or bad option....." or
"...E: 10mount: mount: /run/schroot/mount/lv/sys: not mount point or bad option....."

(complete error for first two copied below at [1],[2])

I've checked those DIRs and they certainly exist when I look.

Does anyone have any suggestions as to the cause, or my next debugging steps ?

[1]
pi@raspberrypi:~ $ sudo systemctl status labview.service
● labview.service - LabVIEW 2021 chroot run-time daemon
Loaded: loaded (/etc/systemd/system/labview.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2024-01-05 17:49:34 AEDT; 6min ago
Process: 481 ExecStartPre=/usr/sbin/schroot-lv-start.sh (code=exited, status=1/FAILURE)
Process: 591 ExecStopPost=/usr/bin/schroot --end-session -c lv (code=exited, status=0/SUCCESS)
CPU: 848ms

Jan 05 17:49:32 raspberrypi systemd[1]: Starting LabVIEW 2021 chroot run-time daemon...
Jan 05 17:49:33 raspberrypi schroot-lv-start.sh[494]: E: 10mount: mount: /run/schroot/mount/lv/dev/pts: not mount point or bad option.
Jan 05 17:49:33 raspberrypi schroot-lv-start.sh[484]: E: lv: Chroot setup failed: stage=setup-start
Jan 05 17:49:33 raspberrypi systemd[1]: labview.service: Control process exited, code=exited, status=1/FAILURE
Jan 05 17:49:34 raspberrypi systemd[1]: labview.service: Failed with result 'exit-code'.
Jan 05 17:49:34 raspberrypi systemd[1]: Failed to start LabVIEW 2021 chroot run-time daemon.
pi@raspberrypi:~ $

[2]
pi@raspberrypi:~ $ sudo systemctl status labview.service
● labview.service - LabVIEW 2021 chroot run-time daemon
Loaded: loaded (/etc/systemd/system/labview.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2024-01-05 18:28:21 AEDT; 1min 11s ago
Process: 474 ExecStartPre=/usr/sbin/schroot-lv-start.sh (code=exited, status=1/FAILURE)
Process: 568 ExecStopPost=/usr/bin/schroot --end-session -c lv (code=exited, status=0/SUCCESS)
CPU: 747ms

Jan 05 18:28:19 raspberrypi systemd[1]: Starting LabVIEW 2021 chroot run-time daemon...
Jan 05 18:28:20 raspberrypi schroot-lv-start.sh[493]: E: 10mount: mount: /run/schroot/mount/lv/proc: not mount point or bad option.
Jan 05 18:28:20 raspberrypi schroot-lv-start.sh[483]: E: lv: Chroot setup failed: stage=setup-start
Jan 05 18:28:20 raspberrypi systemd[1]: labview.service: Control process exited, code=exited, status=1/FAILURE
Jan 05 18:28:21 raspberrypi systemd[1]: labview.service: Failed with result 'exit-code'.
Jan 05 18:28:21 raspberrypi systemd[1]: Failed to start LabVIEW 2021 chroot run-time daemon.
pi@raspberrypi:~ $

Peter

Peter_B · ‎01-05-2024

My latest thoughts include

1. reinstalling the lvrte (21.0.0-2) from the Target Configuration window, or

2. starting fresh from a new Raspbian image, or

3. doing a deeper dive into the error message (see below)

I'm using Bard to help me out. it seems pretty capable, so I'm posting here what I've learned/tried out

Bard wrote

"Observations:

Late Start: You're correct that labview.service starts relatively late in the boot sequence, after most other services.

Key Services Already Started: Essential services like systemd-random-seed.service, systemd-fsck-root.service, and systemd-tmpfiles-setup.service are already running before labview.service.

Implications:

Timing Issue Less Likely: The likelihood of a boot timing issue causing the mount error is indeed reduced based on this output.

Focus on Other Causes: It's worth prioritizing other potential causes, such as:

Configuration errors within schroot or the labview chroot.

Conflicts with other services or processes.

System-specific issues related to your hardware or software setup."

It seemed like a boot related timing issue and that the different directories under `/run/schroot/mount/lv/` were not available when starting the labview.service. so I then looked to see what services had started and when at boot using

systemd-analyze blame

it seems the labview.service is run after all but 2 other services, hence it seems unlikely that the service responsible for mounting the required directories has not already run before the labview.service is run at boot.

7.243s hciuart.service

4.301s apt-daily.service

4.027s labview.service

2.741s dev-mmcblk0p2.device

2.480s cups.service

2.043s udisks2.service

1.879s raspi-config.service

1.717s NetworkManager.service

1.489s polkit.service

1.413s user@0.service

1.397s schroot.service

1.325s systemd-modules-load.service

1.314s ModemManager.service

1.303s avahi-daemon.service

1.005s systemd-journal-flush.service

933ms systemd-logind.service

931ms glamor-test.service

892ms user@1000.service

868ms keyboard-setup.service

864ms wpa_supplicant.service

861ms gldriver-test.service

861ms rng-tools-debian.service

704ms dphys-swapfile.service

686ms systemd-timesyncd.service

630ms networking.service

607ms systemd-udev-trigger.service

584ms ssh.service

543ms systemd-udevd.service

503ms colord.service

498ms dev-mqueue.mount

494ms systemd-fsck-root.service

493ms run-rpc_pipefs.mount

488ms modprobe@configfs.service

487ms sys-kernel-debug.mount

485ms packagekit.service

480ms e2scrub_reap.service

480ms sys-kernel-tracing.mount

477ms rsyslog.service

471ms modprobe@drm.service

466ms fake-hwclock.service

461ms kmod-static-nodes.service

457ms modprobe@fuse.service

444ms lightdm.service

422ms plymouth-quit-wait.service

417ms systemd-fsck@dev-disk-by\x2dpartuuid-16988a49\x2d01.service

395ms systemd-tmpfiles-setup-dev.service

339ms systemd-tmpfiles-clean.service

296ms systemd-journald.service

289ms rpi-eeprom-update.service

212ms bluetooth.service

206ms systemd-remount-fs.service

200ms systemd-tmpfiles-setup.service

185ms triggerhappy.service

151ms systemd-update-utmp.service

139ms rc-local.service

133ms plymouth-read-write.service

106ms systemd-random-seed.service

104ms systemd-sysusers.service

100ms ifupdown-pre.service

93ms console-setup.service

85ms user-runtime-dir@1000.service

84ms systemd-sysctl.service

79ms plymouth-start.service

77ms bthelper@hci0.service

73ms sys-fs-fuse-connections.mount

66ms sys-kernel-config.mount

62ms alsa-restore.service

62ms user-runtime-dir@0.service

54ms nfs-config.service

52ms systemd-user-sessions.service

51ms systemd-update-utmp-runlevel.service

40ms boot.mount

39ms systemd-rfkill.service

31ms rtkit-daemon.service

Peter

Hobbyist Toolkit

Improving deployed executable startup/reboot/longevity reliability

Improving deployed executable startup/reboot/longevity reliability

Re: Improving deployed executable startup/reboot/longevity reliability

Re: Improving deployed executable startup/reboot/longevity reliability

Re: Improving deployed executable startup/reboot/longevity reliability

Re: Improving deployed executable startup/reboot/longevity reliability

Re: Improving deployed executable startup/reboot/longevity reliability

Re: Improving deployed executable startup/reboot/longevity reliability

Re: Improving deployed executable startup/reboot/longevity reliability

Re: Improving deployed executable startup/reboot/longevity reliability

Re: Improving deployed executable startup/reboot/longevity reliability