【ArchSummit架构师峰会】探讨数据与人工智能相互驱动的关系>>> 了解详情
写点什么

coredump 调试记录 - PHP 篇

  • 2019-09-27
  • 本文字数:16531 字

    阅读完需:约 54 分钟

coredump调试记录 - PHP篇

从实际问题出发

Linux 生产环境一个 PHP 脚本发生了 coredump,一群攻城狮利用一系列工具(strace 看系统调用、gdb 断点分析)进行定位、分析。在构造出可以稳定复现问题的环境下,顺利解决该问题。

背景知识以及发展史

中心思想:coredump 和语言无关,本篇文章只是众多 coredump 场景之一,首先了解下 coredump 的前世今生。

名词解释

core:The name comes from magnetic core memory,[3] the principal form of random access memory from the 1950s to the 1970s. The name has remained long after magnetic core technology became obsolete.


在 20 世纪 50 年代至 70 年代,磁芯内存(Magnetic Core Memory)是电脑随机存取存储器(即内存)的主要形式,Core Dump 中的 Core 沿用了磁芯内存的 Core 表达。


dump:Earliest core dumps were paper printouts[6] of the contents of memory, typically arranged in columns of octal or hexadecimal numbers (a “hex dump”), sometimes accompanied by their interpretations as machine language instructions, text strings, or decimal or floating-point numbers (cf. disassembler).


As memory sizes increased and post-mortem analysis utilities were developed, dumps were written to magnetic media like tape or disk.


早期的 core dump 是打印在纸张上的内存内容,通常排列成八进制或者十六进制的数字,有时它伴随着机器语言的解释指令,字符串或者小数或者浮点型。


随着现代分析工具的诞生以及内存容量的增长,Dump 内容被写到了诸如磁带或者磁盘这类磁芯(存储)介质上。


dump 出来的内容是格式化的,可以使用一些工具来解析它。


strace:是一个集诊断、调试、统计与一体的工具,我们可以使用 strace 对应用的系统调用和信号传递的跟踪结果来对应用进行分析,以达到解决问题或者是了解应用工作过程的目的。当然 strace 与专业的调试工具比如说 gdb 之类的是没法相比的,因为它不是一个专业的调试器。


gdb:一个由 GNU 开源组织发布的、UNIX/LINUX 操作系统下的、基于命令行的、功能强大的程序调试工具。


对于一名 Linux 下工作的 c++程序员,gdb 是必不可少的工具,GDB 主要可帮助工程师完成下面 4 个方面的功能:


  • 启动程序,可以按照工程师自定义的要求随心所欲的运行程序。

  • 让被调试的程序在工程师指定的断点处停住,断点可以是条件表达式。

  • 当程序被停住时,可以检查此时程序中所发生的事,并追索上文。

  • 动态地改变程序的执行环境。


Linux:是一种自由和开放源码的类 UNIX 操作系统。


backtrace:支持应用程序进行自我调试(self-debugging)。

摘要

In computing, a core dump, crash dump, memory dump, or system dump[1] consists of the recorded state of the working memory of a computer program at a specific time, generally when the program has crashed or otherwise terminated abnormally.[2] In practice, other key pieces of program state are usually dumped at the same time, including the processor registers, which may include the program counter and stack pointer, memory management information, and other processor and operating system flags and information. A snapshot dump (or snap dump) is a memory dump requested by the computer operator or by the running program, after which the program is able to continue. Core dumps are often used to assist in diagnosing and debugging errors in computer programs. ----Wikipedia.


释义:在计算机领域,核心转储、崩溃转储、内容转储或者系统转储,可以认为是,当程序崩溃或者异常终止时记录的计算机程序的工作内存状态信息。在实际场景下,其他关键的程序状态信息通常也会被转储下来,包括处理器寄存器信息,里面包含了程序计数器和栈指针、内存管理信息和其他处理器和操作系统层的各种标记及信息。


镜像转储是计算机操作员或者正在运行的程序请求的内存的转储,在此之后程序得以继续运行。核心转储通常被用来辅助诊断、调试程序错误。可以简单想象一台高清相机按下快门的瞬间(只是信息会更丰富)。

发展史

内存中 dump 数据打印在纸张上进行分析(内容相对较少) → 现代操作系统创建一种特殊格式的文件,进行格式化输出(内容得到了极大的丰富,详见摘要部分)。目的只有一个:辅助攻城狮调试程序或者找到导致程序异常/崩溃的根本原因。

一、深入了解 coredump

1.1 原理分析

触发场景:应用程序崩溃或异常终止时,本小结讲说明在何种崩溃、何种异常终止情况下会产生 coredump


原理:coredump 是一种基于异步信号的内存信息捕获机制,可以触发 coredump 的常见的信号有 SIGBUS、SIGSEGV、SIGILL、SIGABRT、SIGQUIT 等。查找 man 手册可以获得更多的信息。coredump 机制提供用户态内存信息的镜像,方便用户事后 debug 程序。


The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.所以 SIGKILL 即 kill -9 是无法触发产生 coredump 的,详见下图。


1.2 coredump 造成的影响

业务影响:从上述 coredump 产生的场景可得知,出 Core 意味着进程立即终止,服务不能正常响应,即服务受损,执行 coredump 的耗时与应用程序所占用内存资源多少正比,进程占用内存资源越多,coredump 耗时越长,例如当进程占用 60G+以上内存时,完整 Core 文件需要 15 分钟才能完全写到磁盘上


心智负担:现代攻城狮日常工作大多应用如 Java、PHP、Python 等高级编程语言,相比 C 这类底层语言,对于 coredump 日常接触较少,举个例子:C 语言工程师一个指针越界就会导致 coredump(详见下例),而类似 Java 这种高级语言通过其强大的异常机制将我们从中解救了出来,正是在这种背景下,很多攻城狮见到 core 文件后往往无从下手,更有甚者甚者闻 core 色变。

1.3 如何生成一个 core 文件

通过 SIGQUIT 即"ctrl+"


work@VM_28_112_centos:~/duan/coredump$sleep 10^\Quit (core dumped)
复制代码


gcore 工具


work@VM_28_112_centos:~/duan/coredump$sleep 10 &                        [3] 5299work@VM_28_112_centos:~/duan/coredump$gcore 52990x00007fb8f0b31550 in __nanosleep_nocancel () from /lib64/libc.so.6warning: target file /proc/5299/cmdline contained unexpected null charactersSaved corefile core.5299
复制代码


代码异常


#include<stdio.h>                           #include <stdlib.h>                         #include <string.h>                         int main()                                  {                                               /* SIGSEGV */                               /* char buf[1]; */                          /* buf[1]=1; */                             /* strcpy(buf,"111111111"); */              /* printf("buf is %p:%s\n",buf,buf); */     /* return 0; */                             /* ---------- */                            /* SIGFPE */                                int a=0;                                    int b=1;                                    int c=b/a;                                  return 0;                               }
复制代码

1.4 core 文件格式

work@VM_28_112_centos:~/duan/coredump$readelf -h core.5299ELF Header:  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00  Class:                             ELF64  Data:                              2's complement, little endian  Version:                           1 (current)  OS/ABI:                            UNIX - System V  ABI Version:                       0  Type:                              CORE (Core file)  Machine:                           Advanced Micro Devices X86-64  Version:                           0x1  Entry point address:               0x0  Start of program headers:          64 (bytes into file)  Start of section headers:          368160 (bytes into file)  Flags:                             0x0  Size of this header:               64 (bytes)  Size of program headers:           56 (bytes)  Number of program headers:         15  Size of section headers:           64 (bytes)  Number of section headers:         17  Section header string table index: 16
复制代码

1.5 coredump 设置

大小限制


前面提到"进程占用 60G+以上内存时,完整 Core 文件需要 15 分钟才能完全写到磁盘上",如果 core 文件大小不加以限制,存在机器磁盘被打满进而影响其他服务的风险(从机器利用率角度出发服务混布的情况还是比较普遍的),大小限制是一柄双刃剑 ---- 加限制后 core 文件会被截断,影响后续分析,具体配置方式 ulimit -c ,如果设置为 0,则不会生成 core 文件


另外为了提升服务可用性,我们一般会用 Supervisor 来管理工作进程,它的作用是当一个工作进程意外被杀死,supervisort 监听到进程死后,会自动将它重新拉起,很方便的做到进程自动恢复的功能。在这种场景下一旦服务进程 coredump,Supervisor 就会将其"拉起",进而又会 coredump,出现恶性循环。


生成规则约束


提升可阅读性一般会把服务名以及进程 id 作为 core 文件的命名规则,形如:


cat /proc/sys/kernel/core_pattern
/tmp/core-%e-%p
复制代码


pipe 目前生产环境应用的不多,不做过多展开。

二、回到我们遇到的 case

2.1 具体场景

前面说了这么多,主要目的是让大家对 coredump 有更全面的了解,同时降低一些心智负担。接下来要上场的就是我们的实际场景。


背景:一个定时任务出现 coredump,之前就发生过,通过其他 hack 方式规避掉了问题,并未深究。恰逢近日又遇此场景,千载难逢,决定一探究竟。


确认配置信息:


work@VM_28_112_centos:~/duan/coredump$cat /proc/sys/kernel/core_pattern/tmp/core-%e-%p //core文件放在了/tmp目录下,命名规则core-{可执行文件名}-{进程id}work@VM_28_112_centos:~/duan/coredump$ulimit -acore file size          (blocks, -c) 29 // core文件大小受限,已生成的core文件极有可能是被截断的(亦即受损的,无法提供进行调试) 1blocks = 1024字节data seg size           (kbytes, -d) unlimitedscheduling priority             (-e) 0file size               (blocks, -f) unlimitedpending signals                 (-i) 127965max locked memory       (kbytes, -l) 64max memory size         (kbytes, -m) unlimitedopen files                      (-n) 65535pipe size            (512 bytes, -p) 8POSIX message queues     (bytes, -q) 819200real-time priority              (-r) 0stack size              (kbytes, -s) 8192cpu time               (seconds, -t) unlimitedmax user processes              (-u) 65535virtual memory          (kbytes, -v) unlimitedfile locks                      (-x) unlimitedwork@VM_28_112_centos:~/duan/coredump$ll -tr /tmp/core-*-rw------- 1 www         www            53248 Aug  1 15:50 /tmp/core-php-26527-rw------- 1 www         www            53248 Aug  1 15:51 /tmp/core-php-27365-rw------- 1 www         www            53248 Aug  1 15:52 /tmp/core-php-27854-rw------- 1 www         www            53248 Aug  1 15:53 /tmp/core-php-28451-rw------- 1 www         www            53248 Aug  1 15:54 /tmp/core-php-29338-rw------- 1 www         www            53248 Aug  1 15:55 /tmp/core-php-29878-rw------- 1 www         www            53248 Aug  1 15:56 /tmp/core-php-30619-rw------- 1 www         www            53248 Aug  1 15:57 /tmp/core-php-31314-rw------- 1 www         www            53248 Aug  1 15:58 /tmp/core-php-31859
复制代码


运行环境


  • centos7

  • php 7.0.13


Web framework delivered as a C-extension for PHPphalcon => enabledAuthor => Phalcon Team and contributorsVersion => 3.2.4Build Date => Jan 29 2018 17:41:27Powered by Zephir => Version 0.10.4-11e39849b0
Directive => Local Value => Master Valuephalcon.db.escape_identifiers => On => Onphalcon.db.force_casting => Off => Offphalcon.orm.cast_on_hydrate => Off => Offphalcon.orm.column_renaming => On => Onphalcon.orm.disable_assign_setters => Off => Offphalcon.orm.enable_implicit_joins => On => Onphalcon.orm.enable_literals => On => Onphalcon.orm.events => On => Onphalcon.orm.exception_on_failed_save => Off => Offphalcon.orm.ignore_unknown_columns => Off => Offphalcon.orm.late_state_binding => Off => Offphalcon.orm.not_null_validations => On => Onphalcon.orm.update_snapshot_on_save => On => Onphalcon.orm.virtual_foreign_keys => On => Onwork@VM_28_112_centos:~/duan/coredump$php -vPHP 7.0.13 (cli) (built: Dec 14 2016 11:29:15) ( NTS )Copyright (c) 1997-2016 The PHP GroupZend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies with Zend OPcache v7.0.13, Copyright (c) 1999-2016, by Zend Technologies
复制代码

三、一探究竟

3.1 工具集介绍

gdb:调试 core 文件利器,只介绍接单用法,gdb {exec_file} {corefile} ,调试时 exec_file 一定要和产生 coredump 时的可执行文件保持完全一致,gdb 更详细用法自行百度即可。


ulimit: 如前面所述,管理 coredump 文件大小。


strace :查看系统调用。

3.2 上手调试

注意前两个代码片段中 coredump 文件大小。


work@VM_28_112_centos:~/duan/coredump$gdb /usr/local/matrix/bin/php /tmp/core-php-31859GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7Copyright (C) 2013 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law.  Type "show copying"and "show warranty" for details.This GDB was configured as "x86_64-redhat-linux-gnu".For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>...Reading symbols from /usr/local/matrix/bin/php...Reading symbols from /usr/local/matrix/bin/php...(no debugging symbols found)...done.(no debugging symbols found)...done.BFD: Warning: /tmp/core-php-31859 is truncated: expected core file size >= 11030528, found: 53248.[New LWP 31859]Failed to read a valid object file image from memory.Core was generated by `/usr/local/matrix/bin/php /data0/www/htdocs/xxxx-yyyy/cli/c'.Program terminated with signal 11, Segmentation fault.*** #0  0x00007f2f1e2e430b in ?? ()  //因为被截断所以堆栈信息不可访问 ***Missing separate debuginfos, use: debuginfo-install matrix-php-7.0.13-1.x86_64(gdb) btPython Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7ffcdbea3260:(gdb)
Core was generated by `/usr/local/matrix/bin/php /data0/www/htdocs/xxxx-yyyy/cli/c'.
复制代码


既然 core 文件大小设定了限制,我们把限制打开就行了,ulimit -c unlimited。


work@VM_28_112_centos:~/duan/coredump$ulimit -c unlimited-bash: ulimit: core file size: cannot modify limit: Operation not permittedwork@VM_28_112_centos:~/duan/coredump$sudo ulimit -c unlimitedsudo: ulimit: command not found
复制代码


sudo: ulimit: command not found 解决方案:sudo su 然后执行 ulimit 即可。因为 sudo 是要找到一个可运行的程序,而 ulimit 并不是一个独立的可运行的程序,如下:


work@VM_28_112_centos:~/duan/coredump/phalcon$whereis cdcd: /usr/bin/cd /usr/share/man/man1/cd.1.gzwork@VM_28_112_centos:~/duan/coredump/phalcon$whereis ulimitulimit: /usr/include/ulimit.h /usr/share/man/man1/ulimit.1.gz
复制代码


解决方案:


  • sudo su && ulimit -c unlimited

  • sudo sh -c “ulimit -n 65535 && exec su $LOGNAME”

3.3 复现

work@VM_28_112_centos:~/duan/coredump$/usr/local/matrix/bin/php /data0/www/htdocs/xxx-yyy/cli/cli.php  history_deal_export_v2 mainSegmentation fault (core dumped)work@VM_28_112_centos:~/duan/coredump$gdb /usr/local/matrix/bin/php -c /tmp/core-php-13396GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7                                                                                                                               [1/55207]Copyright (C) 2013 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law.  Type "show copying"and "show warranty" for details.This GDB was configured as "x86_64-redhat-linux-gnu".For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>...Reading symbols from /usr/local/matrix/bin/php...Reading symbols from /usr/local/matrix/bin/php...(no debugging symbols found)...done.(no debugging symbols found)...done.[New LWP 13396][Thread debugging using libthread_db enabled]Using host libthread_db library "/lib64/libthread_db.so.1".Core was generated by `/usr/local/matrix/bin/php /data0/www/htdocs/preview-deal.fang.lianjia.com/cli/c'.Program terminated with signal 11, Segmentation fault.#0  0x00007fcc510e430b in zephir_create_instance.constprop.180 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.soMissing separate debuginfos, use: debuginfo-install matrix-php-7.0.13-1.x86_64(gdb) bt#0  0x00007fcc510e430b in zephir_create_instance.constprop.180 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#1  0x00007fcc51239686 in zim_Phalcon_Di_get () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#2  0x00000000006c3fd4 in zend_call_function ()#3  0x00007fcc510d77d1 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#4  0x00007fcc510d80ca in zephir_call_class_method_aparams.part.35 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#5  0x00007fcc511c4a4b in zim_Phalcon_Di_getShared () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#6  0x00000000006c3fd4 in zend_call_function ()#7  0x00007fcc510d77d1 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#8  0x00007fcc510d80ca in zephir_call_class_method_aparams.part.35 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#9  0x00007fcc51395de5 in zim_Phalcon_Dispatcher_dispatch () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#10 0x00000000006c3fd4 in zend_call_function ()#11 0x00007fcc510d77d1 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#12 0x00007fcc510d80ca in zephir_call_class_method_aparams.part.35 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#13 0x00007fcc511fe52d in zim_Phalcon_Cli_Console_handle () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20121212/phalcon.so#14 0x000000000074bd7b in ZEND_DO_FCALL_SPEC_HANDLER ()#15 0x000000000070dffb in execute_ex ()#16 0x0000000000757767 in zend_execute ()#17 0x00000000006d1b64 in zend_execute_scripts ()#18 0x0000000000676260 in php_execute_script ()#19 0x00000000007592ef in do_cli ()#20 0x00000000004332df in main ()
复制代码


关键信息:


可以看到是在 phalcon 层面发生的 coredump


栈顶信息 zephir_create_instance.constprop.180


借助下 strace 工具 strace -s 2048 -T -tt -o /tmp/straceCore -v /usr/local/matrix/bin/php /data0/www/htdocs/someBiz/cli/cli.php history_deal_export_v2 main。


...20:38:09.832524 stat("/data0/www/htdocs/someBiz/app/library/BaseTask.php", {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000039>20:38:09.832619 lstat("/data0/www/htdocs/someBiz/app/library/BaseTask.php", {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000011>20:38:09.832671 open("/data0/www/htdocs/someBiz/app/library/BaseTask.php", O_RDONLY) = 5 <0.000013>20:38:09.832708 fstat(5, {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000008>20:38:09.832751 fstat(5, {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000011>20:38:09.832812 fstat(5, {st_dev=makedev(253, 16), st_ino=8655576, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=1029, st_atime=2019/08/02-11:11:02.613447598, st_mtime=2019/01/24-18:50:06.896831669, st_ctime=2019/01/24-18:50:07.760833714}) = 0 <0.000009>20:38:09.832855 mmap(NULL, 1029, PROT_READ, MAP_SHARED, 5, 0) = 0x7fad52135000 <0.000011>20:38:09.832949 munmap(0x7fad52135000, 1029) = 0 <0.000012>20:38:09.832988 close(5)                = 0 <0.000009>20:38:09.833074 stat("/data0/www/htdocs/someBiz/app/common/ExportConst.php", {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000017>20:38:09.833160 lstat("/data0/www/htdocs/someBiz/app/common/ExportConst.php", {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000011>20:38:09.833206 lstat("/data0/www/htdocs/someBiz/app/common", {st_dev=makedev(253, 16), st_ino=8390886, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=2019/08/02-20:11:11.652054375, st_mtime=2019/07/31-15:48:31.044078262, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000010>20:38:09.833253 open("/data0/www/htdocs/someBiz/app/common/ExportConst.php", O_RDONLY) = 5 <0.000013>20:38:09.833290 fstat(5, {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000008>20:38:09.833332 fstat(5, {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000008>20:38:09.833374 fstat(5, {st_dev=makedev(253, 16), st_ino=8391700, st_mode=S_IFREG|0755, st_nlink=1, st_uid=99, st_gid=99, st_blksize=4096, st_blocks=64, st_size=31879, st_atime=2019/08/02-15:49:02.358751551, st_mtime=2019/07/31-15:48:29.901078099, st_ctime=2019/07/31-15:48:31.044078262}) = 0 <0.000013>20:38:09.833428 mmap(NULL, 31879, PROT_READ, MAP_SHARED, 5, 0) = 0x7fad5212e000 <0.000011>20:38:09.833783 munmap(0x7fad5212e000, 31879) = 0 <0.000016>20:38:09.833830 close(5)                = 0 <0.000010>20:38:09.833890 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x18} ---20:38:09.849002 +++ killed by SIGSEGV (core dumped) +++
复制代码


关键信息:


当程序访问到:/data0/www/htdocs/someBiz/app/common/ExportConst.php 后程序接到了 SIGSEGV 信号,进而导致 coredump。

3.4 结合业务代码进行分析

入口文件


use Phalcon\Di\FactoryDefault\Cli as CliDI, Phalcon\Cli\Console as ConsoleApp;
define('ROOT_PATH', dirname(__DIR__));define('APP_PATH', ROOT_PATH . DIRECTORY_SEPARATOR . 'app');define("APP_MODE", get_cfg_var('lianjia.environment'));define("APP_NAME", 'deal');
$di = new CliDI();
$config = include ROOT_PATH . '/config/' . APP_MODE . '.php';include ROOT_PATH . '/config/loader.php';include ROOT_PATH . '/config/cli_service.php';$di->get('dispatcher')->setDefaultNamespace('someBiz\tasks');
// 设置php错误日志$error_log = '/tmp/' . APP_NAME . '_php_error_' . date('Ymd') . '.log';ini_set('error_log', $error_log);
\Framework\Context::init();
$console = new ConsoleApp();$console->setDI($di);
/** * 处理console应用参数 */$arguments = array();foreach ($argv as $k => $arg) { if ($k == 1) { $arguments['task'] = $arg; } elseif ($k == 2) { $arguments['action'] = $arg; } elseif ($k >= 3) { $arguments['params'][] = $arg; }}
// 定义全局的参数, 设定当前任务及动作define('CURRENT_TASK', (isset($argv[1]) ? $argv[1] : null));define('CURRENT_ACTION', (isset($argv[2]) ? $argv[2] : null));
try { // 处理参数 $console->handle($arguments); //对应#13 0x00007fcc511fe52d in zim_Phalcon_Cli_Console_handle ()} catch (\Exception $e) { echo $e->getMessage(); exit(255);}
复制代码


结合堆栈信息分析


zephir_create_instance 函数从命名上能看出来是创建一个实例,函数定义详见:


/** * Creates a new instance dynamically. Call constructor without parameters */int zephir_create_instance(zval *return_value, const zval *class_name TSRMLS_DC){
zend_class_entry *ce;
if (Z_TYPE_P(class_name) != IS_STRING) { zephir_throw_exception_string(spl_ce_RuntimeException, SL("Invalid class name") TSRMLS_CC); return FAILURE; }
ce = zend_fetch_class(Z_STRVAL_P(class_name), Z_STRLEN_P(class_name), ZEND_FETCH_CLASS_DEFAULT TSRMLS_CC); if (!ce) { ZVAL_NULL(return_value); return FAILURE; }
object_init_ex(return_value, ce); if (zephir_has_constructor_ce(ce)) { return zephir_call_class_method_aparams(NULL, ce, zephir_fcall_method, return_value, SL("__construct"), NULL, 0, 0, NULL TSRMLS_CC); }
return SUCCESS;}
复制代码


参考堆栈信息,梳理调用链如下:


cphalcon/phalcon/Cli/Console.zep(handle 函数) → cphalcon/phalcon/Cli/Dispatcher.zep(执行父类 phalcon/Dispatcher/AbstractDispatcher.zep dispatch 函数) – call di getShared→ cphalcon/phalcon/Di.zep(getShared 函数) → cphalcon/phalcon/Di.zep(get 函数) – call→ cphalcon/phalcon/Di.zep(create_instalce) → ext/kernel/object.c (zephir_create_instance) 至此 coredump


zephir_create_instance.constprop 看起来像是一个常量成员,结合 strace 日志,第一次访问到 ExportConst.php 文件就是在该 class 的成员变量中,进一步排查。


大胆假设:



小心求证:




进一步推演:


假设是因为这个原因,那不只是 cli 模式下服务异常,cgi 服务也应该有问题才对,所以继续探索。


测试接口:http://develop.api.home.ke.com/decorateplat/alphatest/version?emergency=1 {host 10.26.15.36 develop.api.home.ke.com}


正常 response


{errno: 10004,error: "params version is missing",request_id: "2019080717544285115",data: { },cost: 26}
复制代码


进一步测试:



异常 response:


502 Bad GatewayThe proxy server received an invalid response from an upstream server. Sorry for the inconvenience.Please report this message and include the following information to us.Thank you very much!URL:http://develop.api.home.ke.com/decorateplat/alphatest/version?emergency=1
Server:vm_15_36_centos
Date:2019/08/07 17:56:45

Powered by Tengine
复制代码


日志信息:


➜  develop.api.home.ke.com tail -f /usr/local/matrix/var/log/php-fpm.log[02-Aug-2019 21:02:42] WARNING: [pool api.home.ke.com] child 25508 exited on signal 11 (SIGSEGV - core dumped) after 879742.488115 seconds from start[02-Aug-2019 21:02:42] NOTICE: [pool api.home.ke.com] child 22853 started[02-Aug-2019 21:04:47] WARNING: [pool api.home.ke.com] child 18875 exited on signal 11 (SIGSEGV - core dumped) after 1330478.281358 seconds from start[02-Aug-2019 21:04:47] NOTICE: [pool api.home.ke.com] child 23251 started[07-Aug-2019 17:35:44] WARNING: [pool api.home.ke.com] child 18878 exited on signal 11 (SIGSEGV - core dumped) after 1749935.065729 seconds from start[07-Aug-2019 17:35:44] NOTICE: [pool api.home.ke.com] child 6541 started[07-Aug-2019 17:51:53] WARNING: [pool api.home.ke.com] child 18876 exited on signal 11 (SIGSEGV - core dumped) after 1750904.354166 seconds from start[07-Aug-2019 17:51:53] NOTICE: [pool api.home.ke.com] child 8994 started[07-Aug-2019 17:56:45] WARNING: [pool api.home.ke.com] child 6541 exited on signal 11 (SIGSEGV - core dumped) after 1261.287392 seconds from start[07-Aug-2019 17:56:45] NOTICE: [pool api.home.ke.com] child 9814 started
复制代码


结论以及解决方案:


至此问题应该分析清楚了,phalcon3.2.4 版本程序访问到未定义的常量。导致服务 coredump,在 cli 模式下进程异常退出,cgi 模式下服务响应 502,解决方案如下:


  • 典型的代码 bug,好好测试即可避免,PS:当且仅当成员变量取值为一个不存在的常量时,才会引发 coredump。除此之外均会导致 Fatal Error。小伙伴们感兴趣的话自行验证吧。

  • 目前只发现 3.2.4 版本存在该问题,而生产环境有两个版本共存,即 3.2.4 与 3.4.1,经验证 3.4.1 已修复该问题,故建议升级至 3.4.1 版本。

后记

为什么只有在一个类的成员变量访问到未定义常量时才会 coredump?,目标锁定在 phalcon 框架,既然我们已经定位到了是在 zephir_create_instance.constprop.180 函数阶段发生的异常(详见 backtrace 栈顶信息)。


到源码上一探究竟,phalcon 3.2.4 ext/kernel/object.c,源码上并没有 zephir_create_instance.constprop.180,继续对 phalcon 框架、zephir 深入研究熟悉,里面用到了 LLVM 相关知识点。


待下篇文章分析 b zend_fetch_class coredump 之前最后一次断点。


Breakpoint 3, 0x00000000006c4db0 in zend_fetch_class ()                                                                                      (gdb)Single stepping until exit from function zend_fetch_class,which has no line number information.0x00007fffe4029273 in zephir_create_instance.constprop.180 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so(gdb) bt#0  0x00007fffe4029273 in zephir_create_instance.constprop.180 ()   from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#1  0x00007fffe4250860 in zim_Phalcon_Di_Service_resolve () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#2  0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff4414650, return_value=0x7fffffff9cf0)    at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026#3  0x00000000006c4388 in zend_call_function ()#4  0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#5  0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 ()   from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#6  0x00007fffe417eac3 in zim_Phalcon_Di_get () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#7  0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff44145d0, return_value=0x7fffffffa2c0)    at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026#8  0x00000000006c4388 in zend_call_function ()#9  0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#10 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 ()   from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#11 0x00007fffe410996b in zim_Phalcon_Di_getShared () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#12 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff4414560, return_value=0x7fffffffa8e0)    at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026#13 0x00000000006c4388 in zend_call_function ()#14 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#15 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 ()   from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#16 0x00007fffe4142d4a in zim_Phalcon_Cli_Console_handle () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#17 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff44144f0, return_value=0x7ffff44144d0)    at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026#18 0x000000000074bc14 in ZEND_DO_FCALL_SPEC_HANDLER ()#19 0x000000000070dffb in execute_ex ()#20 0x00007fffeda6d079 in xdebug_execute_ex (execute_data=0x7ffff4414030) at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:1904#21 0x0000000000757767 in zend_execute ()#22 0x00000000006d1b64 in zend_execute_scripts ()---Type <return> to continue, or q <return> to quit---#23 0x0000000000676260 in php_execute_script ()#24 0x00000000007592ef in do_cli ()                                                                                                          #25 0x00000000004332df in main ()(gdb) cContinuing.
Breakpoint 3, 0x00000000006c4db0 in zend_fetch_class ()(gdb)Continuing.
Breakpoint 3, 0x00000000006c4db0 in zend_fetch_class ()(gdb)Continuing.
Program received signal SIGSEGV, Segmentation fault.0x00007fffe402928b in zephir_create_instance.constprop.180 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so(gdb) bt#0 0x00007fffe402928b in zephir_create_instance.constprop.180 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#1 0x00007fffe417e5a6 in zim_Phalcon_Di_get () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#2 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff4414630, return_value=0x7fffffff9650) at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026#3 0x00000000006c4388 in zend_call_function ()#4 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#5 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#6 0x00007fffe410996b in zim_Phalcon_Di_getShared () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#7 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff44145c0, return_value=0x7fffffff9d20) at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026#8 0x00000000006c4388 in zend_call_function ()#9 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#10 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#11 0x00007fffe42d9fd5 in zim_Phalcon_Dispatcher_dispatch () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#12 0x00007fffeda6da35 in xdebug_execute_internal (current_execute_data=0x7ffff4414560, return_value=0x7fffffffa8c0) at /usr/local/src/xdebug-2.7.0RC2/xdebug.c:2026#13 0x00000000006c4388 in zend_call_function ()#14 0x00007fffe401c751 in zephir_call_user_function () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so#15 0x00007fffe401d04a in zephir_call_class_method_aparams.part.35 () from /usr/local/matrix/lib/php/extensions/no-debug-non-zts-20151012/phalcon.so.....
复制代码


作者介绍:


扶墙(企业代号名),目前主要负责贝壳找房新装修业务线客源方向研发以及中心整体技术架构设计工作。


本文转载自公众号贝壳产品技术(ID:gh_9afeb423f390)。


原文链接:


https://mp.weixin.qq.com/s/-VdusyL3d7vpZvNMDMQe_g


2019-09-27 14:373769

评论

发布
暂无评论
发现更多内容

如何降低代码的复杂度?

代码生成器研究

Mural在线白板最全解析!Mural功能|发展历程|替代软件推荐!

彭宏豪95

科技 在线白板 办公软件 在线协作 效率软件

卫龙 x 赛博威 | 为民族品牌数字化建设添砖加瓦

赛博威科技

营销费用管理 赛博威 卫龙

SQL 数据库语句- 创建和管理数据库

小万哥

MySQL 数据库 程序员 sql 后端开发

输出内容价值 | 极客写作训练营

6点无痛早起学习的和尚

代码坏味道

还记得当初自己为什么选择计算机?

代码生成器研究

青椒云云桌面—低配电脑秒变高性能设计神器

青椒云云电脑

桌面云 云桌面 云桌面系统

5 种主要的云电脑解决方案 - 不同之处

青椒云云电脑

云桌面 云电脑 云桌面解决方案

精选21款免费项目管理系统,哪款更适合你?

PingCode

项目管理 项目经理 项目管理软件

行业首家!百度通过DCMM 4级乙方云服务商最高认证,数据管理能力行业领先

百度安全

低代码开发平台有什么优势?

代码生成器研究

直播预告 | 降本增效持续深化,如何找准 FinOps 关键着力点?

小红书技术REDtech

云原生 成本优化 FinOps

2023,“科技无障碍”不谈价值观

脑极体

AI

当我跑越野时,我在想为什么

escray

技术人写作 21 天技术人写作行动营 21 天

HashMap超全源码详解(JDK1.8)

是月月啊2023

Java 面试题

手把手教你使用 RisingWave 流数据库

吴英骏

分布式 rust 流处理 物化视图 数据库设计流程

如何降低代码的复杂度?

代码生成器研究

云桌面系统如何使用?云桌面的优势有哪些?

青椒云云电脑

云桌面 云桌面解决方案 云桌面系统

[译]原生CSS嵌套使用

南城FE

CSS 前端 预处理器 嵌套

低代码开发平台有什么优势?

代码生成器研究

Last Week in Milvus

Zilliz

非结构化数据 Milvus Zilliz AIGC

阿里云数据库MongoDB版助力掌阅平滑上云,撬动数据红利

Geek_2d6073

只要你想,你就能找到一种方法

学渣汪在央企打怪升级

一种典型的负载均衡解决方案

极客罗杰

您距离一个成熟安全的 DevOps 平台,只差一个迁移

极狐GitLab

DevOps gitlab Atlassian Gartner Bamboo

英特尔是如何实现玻璃基板的?

E科讯

云桌面是什么?好用的云桌面推荐?

青椒云云电脑

云桌面 云桌面解决方案

《21 天技术人写作行动营》--工作后最有成就感的一件事

IT蜗壳-Tango

工作以来最有成就感的事(深度思考)

Java 工程师蔡姬

21 天技术人写作行动营 #个人总结 #工作总结 #最有成就感的事 #职场思考

双十一 |顺应平台趋势,在数据中寻找更多生意机会

赛博威科技

数据分析 双十一 电商大促

Wireshark中的http协议包分析

小魏写代码

coredump调试记录 - PHP篇_大前端_扶墙_InfoQ精选文章