coredump 调试记录 - PHP 篇

阅读数:779 2019 年 9 月 27 日 14:37

coredump调试记录 - PHP篇

从实际问题出发

Linux 生产环境一个 PHP 脚本发生了 coredump,一群攻城狮利用一系列工具 (strace 看系统调用、gdb 断点分析) 进行定位、分析。在构造出可以稳定复现问题的环境下,顺利解决该问题。

背景知识以及发展史

中心思想:coredump 和语言无关,本篇文章只是众多 coredump 场景之一,首先了解下 coredump 的前世今生。

名词解释

core:The name comes from magnetic core memory,[3] the principal form of random access memory from the 1950s to the 1970s. The name has remained long after magnetic core technology became obsolete.

在 20 世纪 50 年代至 70 年代,磁芯内存(Magnetic Core Memory)是电脑随机存取存储器 (即内存) 的主要形式,Core Dump 中的 Core 沿用了磁芯内存的 Core 表达。

dump:Earliest core dumps were paper printouts[6] of the contents of memory, typically arranged in columns of octal or hexadecimal numbers (a “hex dump”), sometimes accompanied by their interpretations as machine language instructions, text strings, or decimal or floating-point numbers (cf. disassembler).

As memory sizes increased and post-mortem analysis utilities were developed, dumps were written to magnetic media like tape or disk.

早期的 core dump 是打印在纸张上的内存内容,通常排列成八进制或者十六进制的数字,有时它伴随着机器语言的解释指令,字符串或者小数或者浮点型。

随着现代分析工具的诞生以及内存容量的增长,Dump 内容被写到了诸如磁带或者磁盘这类磁芯 (存储) 介质上。

dump 出来的内容是格式化的,可以使用一些工具来解析它。

strace:是一个集诊断、调试、统计与一体的工具,我们可以使用 strace 对应用的系统调用和信号传递的跟踪结果来对应用进行分析,以达到解决问题或者是了解应用工作过程的目的。当然 strace 与专业的调试工具比如说 gdb 之类的是没法相比的,因为它不是一个专业的调试器。

gdb:一个由 GNU 开源组织发布的、UNIX/LINUX 操作系统下的、基于命令行的、功能强大的程序调试工具。

对于一名 Linux 下工作的 c++ 程序员,gdb 是必不可少的工具,GDB 主要可帮助工程师完成下面 4 个方面的功能:

  • 启动程序,可以按照工程师自定义的要求随心所欲的运行程序。

  • 让被调试的程序在工程师指定的断点处停住,断点可以是条件表达式。

  • 当程序被停住时,可以检查此时程序中所发生的事,并追索上文。

  • 动态地改变程序的执行环境。

Linux:是一种自由和开放源码的类 UNIX 操作系统。

backtrace:支持应用程序进行自我调试 (self-debugging)。

摘要

In computing, a core dump, crash dump, memory dump, or system dump[1] consists of the recorded state of the working memory of a computer program at a specific time, generally when the program has crashed or otherwise terminated abnormally.[2] In practice, other key pieces of program state are usually dumped at the same time, including the processor registers, which may include the program counter and stack pointer, memory management information, and other processor and operating system flags and information. A snapshot dump (or snap dump) is a memory dump requested by the computer operator or by the running program, after which the program is able to continue. Core dumps are often used to assist in diagnosing and debugging errors in computer programs. ----Wikipedia.

释义:在计算机领域,核心转储、崩溃转储、内容转储或者系统转储,可以认为是,当程序崩溃或者异常终止时记录的计算机程序的工作内存状态信息。在实际场景下,其他关键的程序状态信息通常也会被转储下来,包括处理器寄存器信息,里面包含了程序计数器和栈指针、内存管理信息和其他处理器和操作系统层的各种标记及信息。

镜像转储是计算机操作员或者正在运行的程序请求的内存的转储,在此之后程序得以继续运行。核心转储通常被用来辅助诊断、调试程序错误。可以简单想象一台高清相机按下快门的瞬间 (只是信息会更丰富)。

发展史

内存中 dump 数据打印在纸张上进行分析 (内容相对较少) → 现代操作系统创建一种特殊格式的文件,进行格式化输出 (内容得到了极大的丰富,详见摘要部分)。目的只有一个:辅助攻城狮调试程序或者找到导致程序异常 / 崩溃的根本原因。

一、深入了解 coredump

1.1 原理分析

触发场景:应用程序崩溃或异常终止时,本小结讲说明在何种崩溃、何种异常终止情况下会产生 coredump

原理:coredump 是一种基于异步信号的内存信息捕获机制,可以触发 coredump 的常见的信号有 SIGBUS、SIGSEGV、SIGILL、SIGABRT、SIGQUIT 等。查找 man 手册可以获得更多的信息。coredump 机制提供用户态内存信息的镜像,方便用户事后 debug 程序。

The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored. 所以 SIGKILL 即 kill -9 是无法触发产生 coredump 的,详见下图。

coredump调试记录 - PHP篇

1.2 coredump 造成的影响

业务影响:从上述 coredump 产生的场景可得知,出 Core 意味着进程立即终止,服务不能正常响应,即服务受损,执行 coredump 的耗时与应用程序所占用内存资源多少正比,进程占用内存资源越多,coredump 耗时越长,例如当进程占用 60G+ 以上内存时,完整 Core 文件需要 15 分钟才能完全写到磁盘上

心智负担:现代攻城狮日常工作大多应用如 Java、PHP、Python 等高级编程语言,相比 C 这类底层语言,对于 coredump 日常接触较少,举个例子:C 语言工程师一个指针越界就会导致 coredump(详见下例),而类似 Java 这种高级语言通过其强大的异常机制将我们从中解救了出来,正是在这种背景下,很多攻城狮见到 core 文件后往往无从下手,更有甚者甚者闻 core 色变。

1.3 如何生成一个 core 文件

通过 SIGQUIT 即 "ctrl+"

复制代码
work@VM_28_112_centos:~/duan/coredump$sleep 10
^\Quit (core dumped)

gcore 工具

复制代码
work@VM_28_112_centos:~/duan/coredump$sleep 10 &
[3] 5299
work@VM_28_112_centos:~/duan/coredump$gcore 5299
0x00007fb8f0b31550 in __nanosleep_nocancel () from /lib64/libc.so.6
warning: target file /proc/5299/cmdline contained unexpected null characters
Saved corefile core.5299

代码异常

复制代码
#include<stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
/* SIGSEGV */
/* char buf[1]; */
/* buf[1]=1; */
/* strcpy(buf,"111111111"); */
/* printf("buf is %p:%s\n",buf,buf); */
/* return 0; */
/* ---------- */
/* SIGFPE */
int a=0;
int b=1;
int c=b/a;
return 0;
}

1.4 core 文件格式

复制代码
work@VM_28_112_centos:~/duan/coredump$readelf -h core.5299
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: CORE (Core file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x0
Start of program headers: 64 (bytes into file)
Start of section headers: 368160 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 15
Size of section headers: 64 (bytes)
Number of section headers: 17
Section header string table index: 16

1.5 coredump 设置

大小限制

前面提到 " 进程占用 60G+ 以上内存时,完整 Core 文件需要 15 分钟才能完全写到磁盘上 ",如果 core 文件大小不加以限制,存在机器磁盘被打满进而影响其他服务的风险 (从机器利用率角度出发服务混布的情况还是比较普遍的),大小限制是一柄双刃剑 ---- 加限制后 core 文件会被截断,影响后续分析,具体配置方式 ulimit -c ,如果设置为 0,则不会生成 core 文件

另外为了提升服务可用性,我们一般会用 Supervisor 来管理工作进程,它的作用是当一个工作进程意外被杀死,supervisort 监听到进程死后,会自动将它重新拉起,很方便的做到进程自动恢复的功能。在这种场景下一旦服务进程 coredump,Supervisor 就会将其 " 拉起 ",进而又会 coredump,出现恶性循环。

生成规则约束

提升可阅读性一般会把服务名以及进程 id 作为 core 文件的命名规则,形如:

复制代码
cat /proc/sys/kernel/core_pattern
/tmp/core-%e-%p

pipe 目前生产环境应用的不多,不做过多展开。

二、回到我们遇到的 case

2.1 具体场景

前面说了这么多,主要目的是让大家对 coredump 有更全面的了解,同时降低一些心智负担。接下来要上场的就是我们的实际场景。

背景:一个定时任务出现 coredump,之前就发生过,通过其他 hack 方式规避掉了问题,并未深究。恰逢近日又遇此场景,千载难逢,决定一探究竟。

确认配置信息:

复制代码
work@VM_28_112_centos:~/duan/coredump$cat /proc/sys/kernel/core_pattern
/tmp/core-%e-%p //core 文件放在了 /tmp 目录下,命名规则 core-{可执行文件名}-{进程 id}
work@VM_28_112_centos:~/duan/coredump$ulimit -a
core file size (blocks, -c) 29 // core 文件大小受限,已生成的 core 文件极有可能是被截断的 (亦即受损的,无法提供进行调试) 1blocks = 1024 字节
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127965
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 65535
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
work@VM_28_112_centos:~/duan/coredump$ll -tr /tmp/core-*
-rw------- 1 www www 53248 Aug 1 15:50 /tmp/core-php-26527
-rw------- 1 www www 53248 Aug 1 15:51 /tmp/core-php-27365
-rw------- 1 www www 53248 Aug 1 15:52 /tmp/core-php-27854
-rw------- 1 www www 53248 Aug 1 15:53 /tmp/core-php-28451
-rw------- 1 www www 53248 Aug 1 15:54 /tmp/core-php-29338
-rw------- 1 www www 53248 Aug 1 15:55 /tmp/core-php-29878
-rw------- 1 www www 53248 Aug 1 15:56 /tmp/core-php-30619
-rw------- 1 www www 53248 Aug 1 15:57 /tmp/core-php-31314
-rw------- 1 www www 53248 Aug 1 15:58 /tmp/core-php-31859

运行环境

  • centos7
  • php 7.0.13
复制代码
Web framework delivered as a C-extension for PHP
phalcon => enabled
Author => Phalcon Team and contributors
Version => 3.2.4
Build Date => Jan 29 2018 17:41:27
Powered by Zephir => Version 0.10.4-11e39849b0
Directive => Local Value => Master Value
phalcon.db.escape_identifiers => On => On
phalcon.db.force_casting => Off => Off
phalcon.orm.cast_on_hydrate => Off => Off
phalcon.orm.column_renaming => On => On
phalcon.orm.disable_assign_setters => Off => Off
phalcon.orm.enable_implicit_joins => On => On
phalcon.orm.enable_literals => On => On
phalcon.orm.events => On => On
phalcon.orm.exception_on_failed_save => Off => Off
phalcon.orm.ignore_unknown_columns => Off => Off
phalcon.orm.late_state_binding => Off => Off
phalcon.orm.not_null_validations => On => On
phalcon.orm.update_snapshot_on_save => On => On
phalcon.orm.virtual_foreign_keys => On => On
work@VM_28_112_centos:~/duan/coredump$php -v
PHP 7.0.13 (cli) (built: Dec 14 2016 11:29:15) ( NTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
with Zend OPcache v7.0.13, Copyright (c) 1999-2016, by Zend Technologies

三、一探究竟

3.1 工具集介绍

gdb:调试 core 文件利器,只介绍接单用法,gdb {exec_file} {corefile} ,调试时 exec_file 一定要和产生 coredump 时的可执行文件保持完全一致,gdb 更详细用法自行百度即可。

ulimit: 如前面所述,管理 coredump 文件大小。

strace :查看系统调用。

3.2 上手调试

注意前两个代码片段中 coredump 文件大小。

复制代码
work@VM_28_112_centos:~/duan/coredump$gdb /usr/local/matrix/bin/php /tmp/core-php-31859
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/matrix/bin/php...Reading symbols from /usr/local/matrix/bin/php...(no debugging symbols found)...done.
(no debugging symbols found)...done.
BFD: Warning: /tmp/core-php-31859 is truncated: expected core file size >= 11030528, found: 53248.
[New LWP 31859]
Failed to read a valid object file image from memory.
Core was generated by `/usr/local/matrix/bin/php /data0/www/htdocs/xxxx-yyyy/cli/c'.
Program terminated with signal 11, Segmentation fault.
*** #0 0x00007f2f1e2e430b in ?? () // 因为被截断所以堆栈信息不可访问 ***
Missing separate debuginfos, use: debuginfo-install matrix-php-7.0.13-1.x86_64
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7ffcdbea3260:
(gdb)
Core was generated by `/usr/local/matrix/bin/php /data0/www/htdocs/xxxx-yyyy/cli/c'.

既然 core 文件大小设定了限制,我们把限制打开就行了,ulimit -c unlimited。

复制代码
work@VM_28_112_centos:~/duan/coredump$ulimit -c unlimited
-bash: ulimit: core file size: cannot modify limit: Operation not permitted
work@VM_28_112_centos:~/duan/coredump$sudo ulimit -c unlimited
sudo: ulimit: command not found

sudo: ulimit: command not found 解决方案:sudo su 然后执行 ulimit 即可。因为 sudo 是要找到一个可运行的程序,而 ulimit 并不是一个独立的可运行的程序,如下:

复制代码
work@VM_28_112_centos:~/duan/coredump/phalcon$whereis cd
cd: /usr/bin/cd /usr/share/man/man1/cd.1.gz
work@VM_28_112_centos:~/duan/coredump/phalcon$whereis ulimit
ulimit: /usr/include/ulimit.h /usr/share/man/man1/ulimit.1.gz

解决方案:

  • sudo su && ulimit -c unlimited
  • sudo sh -c “ulimit -n 65535 && exec su $LOGNAME”

3.3 复现

复制代码
work@VM_28_112_centos:~/duan/coredump$/usr/local/matrix/bin/php /data0/www/htdocs/xxx-yyy/cli/cli.php history_deal_export_v2 main
Segmentation fault (core dumped)
work@VM_28_112_centos:~/duan/coredump$gdb /usr/local/matrix/bin/php -c /tmp/core-php-13396
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 [1/55207]
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/matrix/bin/php...Reading symbols from /usr/local/matrix/bin/php...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 13396]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/matrix/bin/php /data0/www/htdocs/preview-deal.fang.lianjia.com/cli/c'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fcc510e430b in zephir_create_instance.constprop.180 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
Missing separate debuginfos, use: debuginfo-install matrix-php-7.0.13-1.x86_64
(gdb) bt
#0 0x00007fcc510e430b in zephir_create_instance.constprop.180 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#1 0x00007fcc51239686 in zim_Phalcon_Di_get () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#2 0x00000000006c3fd4 in zend_call_function ()
#3 0x00007fcc510d77d1 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#4 0x00007fcc510d80ca in zephir_call_class_method_aparams.part.35 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#5 0x00007fcc511c4a4b in zim_Phalcon_Di_getShared () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#6 0x00000000006c3fd4 in zend_call_function ()
#7 0x00007fcc510d77d1 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#8 0x00007fcc510d80ca in zephir_call_class_method_aparams.part.35 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#9 0x00007fcc51395de5 in zim_Phalcon_Dispatcher_dispatch () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#10 0x00000000006c3fd4 in zend_call_function ()
#11 0x00007fcc510d77d1 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#12 0x00007fcc510d80ca in zephir_call_class_method_aparams.part.35 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#13 0x00007fcc511fe52d in zim_Phalcon_Cli_Console_handle () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so
#14 0x000000000074bd7b in ZEND_DO_FCALL_SPEC_HANDLER ()
#15 0x000000000070dffb in execute_ex ()
#16 0x0000000000757767 in zend_execute ()
#17 0x00000000006d1b64 in zend_execute_scripts ()
#18 0x0000000000676260 in php_execute_script ()
#19 0x00000000007592ef in do_cli ()
#20 0x00000000004332df in main ()

关键信息:

可以看到是在 phalcon 层面发生的 coredump

栈顶信息 zephir_create_instance.constprop.180
借助下 strace 工具 strace -s 2048 -T -tt -o /tmp/straceCore -v /usr/local/matrix/bin/php /data0/www/htdocs/someBiz/cli/cli.php history_deal_export_v2 main。

复制代码
...
20:38:09.832524 stat("/data0/www/htdocs/someBiz/app/library/BaseTask.php", {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000039>
20:38:09.832619 lstat("/data0/www/htdocs/someBiz/app/library/BaseTask.php", {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000011>
20:38:09.832671 open("/data0/www/htdocs/someBiz/app/library/BaseTask.php", O_RDONLY) = 5 <0.000013>
20:38:09.832708 fstat(5, {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000008>
20:38:09.832751 fstat(5, {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000011>
20:38:09.832812 fstat(5, {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000009>
20:38:09.832855 mmap(NULL, 1029, PROT_READ, MAP_SHARED, 5, 0) = 0x7fad52135000 <0.000011>
20:38:09.832949 munmap(0x7fad52135000, 1029) = 0 <0.000012>
20:38:09.832988 close(5) = 0 <0.000009>
20:38:09.833074 stat("/data0/www/htdocs/someBiz/app/common/ExportConst.php", {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000017>
20:38:09.833160 lstat("/data0/www/htdocs/someBiz/app/common/ExportConst.php", {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000011>
20:38:09.833206 lstat("/data0/www/htdocs/someBiz/app/common", {st_dev=makedev(253, 16), st_ino=8390886, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=2019/08/02-20:11:11.652054375, st_mtime=2019/07/31-15:48:31.044078262, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000010>
20:38:09.833253 open("/data0/www/htdocs/someBiz/app/common/ExportConst.php", O_RDONLY) = 5 <0.000013>
20:38:09.833290 fstat(5, {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000008>
20:38:09.833332 fstat(5, {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000008>
20:38:09.833374 fstat(5, {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000013>
20:38:09.833428 mmap(NULL, 31879, PROT_READ, MAP_SHARED, 5, 0) = 0x7fad5212e000 <0.000011>
20:38:09.833783 munmap(0x7fad5212e000, 31879) = 0 <0.000016>
20:38:09.833830 close(5) = 0 <0.000010>
20:38:09.833890 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x18} ---
20:38:09.849002 +++ killed by SIGSEGV (core dumped) +++

关键信息:

当程序访问到:/data0/www/htdocs/someBiz/app/common/ExportConst.php 后程序接到了 SIGSEGV 信号,进而导致 coredump。

3.4 结合业务代码进行分析

入口文件

复制代码
use Phalcon\Di\FactoryDefault\Cli as CliDI, Phalcon\Cli\Console as ConsoleApp;
define('ROOT_PATH', dirname(__DIR__));
define('APP_PATH', ROOT_PATH . DIRECTORY_SEPARATOR . 'app');
define("APP_MODE", get_cfg_var('lianjia.environment'));
define("APP_NAME", 'deal');
$di = new CliDI();
$config = include ROOT_PATH . '/config/' . APP_MODE . '.php';
include ROOT_PATH . '/config/loader.php';
include ROOT_PATH . '/config/cli_service.php';
$di->get('dispatcher')->setDefaultNamespace('someBiz\tasks');
// 设置 php 错误日志
$error_log = '/tmp/' . APP_NAME . '_php_error_' . date('Ymd') . '.log';
ini_set('error_log', $error_log);
\Framework\Context::init();
$console = new ConsoleApp();
$console->setDI($di);
/**
* 处理 console 应用参数
*/
$arguments = array();
foreach ($argv as $k => $arg) {
if ($k == 1) {
$arguments['task'] = $arg;
} elseif ($k == 2) {
$arguments['action'] = $arg;
} elseif ($k >= 3) {
$arguments['params'][] = $arg;
}
}
// 定义全局的参数, 设定当前任务及动作
define('CURRENT_TASK', (isset($argv[1]) ? $argv[1] : null));
define('CURRENT_ACTION', (isset($argv[2]) ? $argv[2] : null));
try {
// 处理参数
$console->handle($arguments); // 对应#13 0x00007fcc511fe52d in zim_Phalcon_Cli_Console_handle ()
} catch (\Exception $e) {
echo $e->getMessage();
exit(255);
}

结合堆栈信息分析
zephir_create_instance 函数从命名上能看出来是创建一个实例,函数定义详见:

复制代码
/**
* Creates a new instance dynamically. Call constructor without parameters
*/
int zephir_create_instance(zval *return_value, const zval *class_name TSRMLS_DC){
zend_class_entry *ce;
if (Z_TYPE_P(class_name) != IS_STRING) {
zephir_throw_exception_string(spl_ce_RuntimeException, SL("Invalid class name") TSRMLS_CC);
return FAILURE;
}
ce = zend_fetch_class(Z_STRVAL_P(class_name), Z_STRLEN_P(class_name), ZEND_FETCH_CLASS_DEFAULT TSRMLS_CC);
if (!ce) {
ZVAL_NULL(return_value);
return FAILURE;
}
object_init_ex(return_value, ce);
if (zephir_has_constructor_ce(ce)) {
return zephir_call_class_method_aparams(NULL, ce, zephir_fcall_method, return_value, SL("__construct"), NULL, 0, 0, NULL
TSRMLS_CC);
}
return SUCCESS;
}

参考堆栈信息,梳理调用链如下:

cphalcon/phalcon/Cli/Console.zep(handle 函数) → cphalcon/phalcon/Cli/Dispatcher.zep(执行父类 phalcon/Dispatcher/AbstractDispatcher.zep dispatch 函数) – call di getShared→ cphalcon/phalcon/Di.zep(getShared 函数) → cphalcon/phalcon/Di.zep(get 函数) – call→ cphalcon/phalcon/Di.zep(create_instalce) → ext/kernel/object.c (zephir_create_instance) 至此 coredump

zephir_create_instance.constprop 看起来像是一个常量成员,结合 strace 日志,第一次访问到 ExportConst.php 文件就是在该 class 的成员变量中,进一步排查。

大胆假设:

coredump调试记录 - PHP篇

小心求证:

coredump调试记录 - PHP篇coredump调试记录 - PHP篇

进一步推演:

假设是因为这个原因,那不只是 cli 模式下服务异常,cgi 服务也应该有问题才对,所以继续探索。

测试接口: http://develop.api.home.ke.com/decorateplat/alphatest/version?emergency=1 {host 10.26.15.36 develop.api.home.ke.com }

正常 response

复制代码
{
errno: 10004,
error: "params version is missing",
request_id: "2019080717544285115",
data: { },
cost: 26
}

进一步测试:

coredump调试记录 - PHP篇

异常 response:

复制代码
502 Bad Gateway
The proxy server received an invalid response from an upstream server. Sorry for the inconvenience.
Please report this message and include the following information to us.
Thank you very much!
URL:
http://develop.api.home.ke.com/decorateplat/alphatest/version?emergency=1
Server:
vm_15_36_centos
Date:
2019/08/07 17:56:45
Powered by Tengine

日志信息:

复制代码
➜ develop.api.home.ke.com tail -f /usr/local/matrix/var/log/php-fpm.log
[02-Aug-2019 21:02:42] WARNING: [pool api.home.ke.com] child 25508 exited on signal 11 (SIGSEGV - core dumped) after 879742.488115 seconds from start
[02-Aug-2019 21:02:42] NOTICE: [pool api.home.ke.com] child 22853 started
[02-Aug-2019 21:04:47] WARNING: [pool api.home.ke.com] child 18875 exited on signal 11 (SIGSEGV - core dumped) after 1330478.281358 seconds from start
[02-Aug-2019 21:04:47] NOTICE: [pool api.home.ke.com] child 23251 started
[07-Aug-2019 17:35:44] WARNING: [pool api.home.ke.com] child 18878 exited on signal 11 (SIGSEGV - core dumped) after 1749935.065729 seconds from start
[07-Aug-2019 17:35:44] NOTICE: [pool api.home.ke.com] child 6541 started
[07-Aug-2019 17:51:53] WARNING: [pool api.home.ke.com] child 18876 exited on signal 11 (SIGSEGV - core dumped) after 1750904.354166 seconds from start
[07-Aug-2019 17:51:53] NOTICE: [pool api.home.ke.com] child 8994 started
[07-Aug-2019 17:56:45] WARNING: [pool api.home.ke.com] child 6541 exited on signal 11 (SIGSEGV - core dumped) after 1261.287392 seconds from start
[07-Aug-2019 17:56:45] NOTICE: [pool api.home.ke.com] child 9814 started

结论以及解决方案:

至此问题应该分析清楚了,phalcon3.2.4 版本程序访问到未定义的常量。导致服务 coredump,在 cli 模式下进程异常退出,cgi 模式下服务响应 502,解决方案如下:

  • 典型的代码 bug,好好测试即可避免,PS:当且仅当成员变量取值为一个不存在的常量时,才会引发 coredump。除此之外均会导致 Fatal Error。小伙伴们感兴趣的话自行验证吧。

  • 目前只发现 3.2.4 版本存在该问题,而生产环境有两个版本共存,即 3.2.4 与 3.4.1,经验证 3.4.1 已修复该问题,故建议升级至 3.4.1 版本。

后记

为什么只有在一个类的成员变量访问到未定义常量时才会 coredump?,目标锁定在 phalcon 框架,既然我们已经定位到了是在 zephir_create_instance.constprop.180 函数阶段发生的异常 (详见 backtrace 栈顶信息)。

到源码上一探究竟,phalcon 3.2.4 ext/kernel/object.c,源码上并没有 zephir_create_instance.constprop.180,继续对 phalcon 框架、zephir 深入研究熟悉,里面用到了 LLVM 相关知识点。

待下篇文章分析 b zend_fetch_class coredump 之前最后一次断点。

复制代码
Breakpoint 3, 0x00000000006c4db0 in zend_fetch_class ()
(gdb)
Single stepping until exit from function zend_fetch_class,
which has no line number information.
0x00007fffe4029273 in zephir_create_instance.constprop.180 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
(gdb) bt
#0 0x00007fffe4029273 in zephir_create_instance.constprop.180 ()
from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#1 0x00007fffe4250860 in zim_Phalcon_Di_Service_resolve () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#2 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff4414650, return_value=0x7fffffff9cf0)
at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026
#3 0x00000000006c4388 in zend_call_function ()
#4 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#5 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 ()
from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#6 0x00007fffe417eac3 in zim_Phalcon_Di_get () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#7 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff44145d0, return_value=0x7fffffffa2c0)
at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026
#8 0x00000000006c4388 in zend_call_function ()
#9 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#10 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 ()
from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#11 0x00007fffe410996b in zim_Phalcon_Di_getShared () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#12 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff4414560, return_value=0x7fffffffa8e0)
at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026
#13 0x00000000006c4388 in zend_call_function ()
#14 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#15 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 ()
from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#16 0x00007fffe4142d4a in zim_Phalcon_Cli_Console_handle () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#17 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff44144f0, return_value=0x7ffff44144d0)
at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026
#18 0x000000000074bc14 in ZEND_DO_FCALL_SPEC_HANDLER ()
#19 0x000000000070dffb in execute_ex ()
#20 0x00007fffeda6d079 in xdebug_execute_ex (execute_data=0x7ffff4414030) at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:1904
#21 0x0000000000757767 in zend_execute ()
#22 0x00000000006d1b64 in zend_execute_scripts ()
---Type <return> to continue, or q <return> to quit---
#23 0x0000000000676260 in php_execute_script ()
#24 0x00000000007592ef in do_cli ()
#25 0x00000000004332df in main ()
(gdb) c
Continuing.
Breakpoint 3, 0x00000000006c4db0 in zend_fetch_class ()
(gdb)
Continuing.
Breakpoint 3, 0x00000000006c4db0 in zend_fetch_class ()
(gdb)
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x00007fffe402928b in zephir_create_instance.constprop.180 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
(gdb) bt
#0 0x00007fffe402928b in zephir_create_instance.constprop.180 ()
from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#1 0x00007fffe417e5a6 in zim_Phalcon_Di_get () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#2 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff4414630, return_value=0x7fffffff9650)
at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026
#3 0x00000000006c4388 in zend_call_function ()
#4 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#5 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 ()
from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#6 0x00007fffe410996b in zim_Phalcon_Di_getShared () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#7 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff44145c0, return_value=0x7fffffff9d20)
at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026
#8 0x00000000006c4388 in zend_call_function ()
#9 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#10 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 ()
from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#11 0x00007fffe42d9fd5 in zim_Phalcon_Dispatcher_dispatch () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#12 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff4414560, return_value=0x7fffffffa8c0)
at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026
#13 0x00000000006c4388 in zend_call_function ()
#14 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
#15 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 ()
from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so
.....

作者介绍:
扶墙(企业代号名),目前主要负责贝壳找房新装修业务线客源方向研发以及中心整体技术架构设计工作。

本文转载自公众号贝壳产品技术(ID:gh_9afeb423f390)。

原文链接:

https://mp.weixin.qq.com/s/-VdusyL3d7vpZvNMDMQe_g

评论

发布