Starting with Strace

What is strace?

Strace is a “system call trace” program. It attaches to a process and tracks system calls and signals made to and from it (and possibly it’s children). There are limitations to strace, some of which are detailed below. However, strace can be a very valuable tool for determining the root cause of many issues. This post only covers very basic usage of strace. The example used here can be done using jailshell.

What is a process?

For the purposes of this tutorial, a process is a unique instance of program being run, identified on the system by a unique process identifier (pid). For more information see http://www.linfo.org/process.html and http://en.wikipedia.org/wiki/Process_identifier

What is a system call?

A system call is how a program requests that the operating system kernel perform a task for it. There are quite a few “normal” tasks that the operating system kernel does for programs regularly, including extremely common file operations like reading and writing. More information regarding system calls can be found at: http://en.wikipedia.org/wiki/System_call

What information can strace capture? Not capture?

Strace can tell you

what files and network connections were opened, closed, or attempted
what was read from and written to from said files and connections

Strace cannot tell you what is happening inside the process. It is not a debugger and is not aware of the variables used inside the process.

What do I have to know before I can use strace?

Strace is a power tool in your troubleshooting bag. But it’s not the most basic tool. General troubleshooting (checking log files, researching error messages, etc) should be done before attempting to solve a problem with strace. Before using strace, you’ll need to be able to reliably reproduce the issue so that you can catch it with strace. You’ll also need to determine what process to strace (covered later).

When am I not able to use strace?

In addition to there being times when strace will not be helpful, there are a few scenarios when you can not use strace. While most normal processes can be attached to by strace, there are some exceptions.

You cannot attach to a process in these scenarios:

when a debugger or other tracing program is already attached
when you do not own the process, unless you are running the strace as root
when the system has special protections preventing strace from being used to gather information

Resource considerations

Output files created by strace files do get very big very quickly, but more importantly are the numerous writes that strace output produces. This i/o hit is not mitigated (and is frequently made worse) by outputting to the terminal instead of a file.

Before you start to strace

Ensuring strace is installed

You can determine if strace is installed by typing:

strace -V

If strace is installed, you’ll see something like: strace — version 4.5.19. If it’s not installed, you’ll see something like bash: strace: command not found. Install strace with:

yum install strace

This command does require root access, so if you do not have root, you will need to request that your web hosting provider install strace for you.

Finding the right process

The ps command is very useful if you need to attach strace to a long running existing process (like a backup that you think may have stalled). The important column is the PID column. It will frequently be either the first or second column, depending on which options you use.

What output you can see with the ps command is greatly dependent on your permissions for the server. A user with jailshell will frequently see very little:
cptest@cptest.test [~]# ps uxfww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND cptest 1 0.0 0.1 108300 1780 ? S 10:01 0:00 jailshell cptest 10 0.0 0.1 110192 1036 ? R+ 10:02 0:00 ps wwwfux cptest@cptest.test [~]#

The root user will see quite a bit more (part of the results from the command ps uxfww):
root 1042 0.0 0.1 249088 1992 ? Sl Aug27 0:02 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5 root 1120 0.0 0.0 4080 624 ? Ss Aug27 0:00 /usr/sbin/acpid root 1192 0.0 0.1 66616 1220 ? Ss Aug27 0:00 /usr/sbin/sshd root 10424 0.0 0.3 104484 4068 ? Ss 09:59 0:00 _ sshd: root@pts/0 root 10432 0.0 0.1 108296 1828 pts/0 Ss 09:59 0:00 _ -bash root 10486 0.0 0.1 110340 1132 pts/0 R+ 10:00 0:00 _ ps wwwfux root 1204 0.0 0.1 108296 1544 ? S Aug27 0:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --pid-file=/var/lib/mysql/nt13188.product.cpanel.vm.pid root 1355 0.0 0.1 19608 1100 ? Ss Aug27 0:01 /usr/sbin/dovecot root 1368 0.0 0.1 13444 1280 ? S Aug27 0:00 _ dovecot/log root 1372 0.0 0.2 16428 2084 ? S Aug27 0:01 _ dovecot/config root 1449 0.0 5.0 165480 51148 ? Ss Aug27 2:09 /usr/local/cpanel/3rdparty/perl/514/bin/spamd -d --allowed-ips=127.0.0.1 --pidfile=/var/run/spamd.pid --max-children=3 --max-spare=1 root 1557 0.0 4.8 165480 49116 ? S Aug27 0:00 _ spamd child root 1471 0.0 0.0 110316 948 ? Ss Aug27 0:00 /usr/sbin/abrtd root 1483 0.0 0.5 73712 5176 ? Ss Aug27 0:33 /usr/local/apache/bin/httpd -k start -DSSL

A jailshell user will not be able to see processes that it did not start in the terminal they are running ps from. This includes php processes and cPanel login sessions. A normal/bash shell user will be able to see all their processes – and processes running as other users.

If you are attaching strace as the process starts, you will not need to use ps to determine the pid. If you only have jail shell access, you will be limited to using strace on processes you can start yourself (from the jail shell).

Starting strace and basic options

There are two ways to attach a strace to a process: you can start the process with strace attached or you can attach strace to an already running process.

Strace a process start to finish

Staring a process with strace already attached means starting strace and having strace start the process. You tell strace you want it to start the process by adding the command you want to strace as the last argument of strace. This sounds complicated, but is easy. For example, if you wanted to strace the script ./chdir.sh start to finish, you would use this command:

# strace ./chdir.sh

will print the strace output generated when running the shell script chdir.sh to the screen. Note that the command does run normally – any actions normally performed by the script being invoked will occur normally. Strace logs them, it does not stop them from occurring.

Attaching to an already running process: -p PID

If instead, you wish to attach to a long running process (like, say, a backup) you would use the similar command:

# strace -p 12345

In this case, the orange text is the -p (pid) option followed by the target process’s process ID. To detach from a strace, use Ctrl-D.

Saving strace output to a file: -o filename

In most cases, you’ll want to save your strace output to a file. This is frequently faster than outputting to the terminal, and it gives you the opportunity to easily search the output using tools like grep. For example to save the output of the above command to a file called “debug.strace” in the current directory, you would use the command:

# strace -o debug.strace -p 12345

You can use any path that you have access to. The file does not need to already exist. If it does, strace will overwrite it. If you use a path you don’t have access to, you’ll get a Permission denied error and strace won’t attach:

mary@mary.test [~/public_html]# strace -f -o /root/perl-quiet2.strace ./open-silent.pl strace: can't fopen '/root/perl-quiet2.strace': Permission denied mary@mary.test [~/public_html]#

Following forks: -f and -ff

Many processes create child processes to handle some of the work, and you will almost certainly need to tell strace to log them as well. This is recursive – if the child process creates a child, that “grandchild” will also be included in the strace. This is called “following forks”, and there are two ways to do so with strace. You can have strace merge all the output together with -f, or you can use -ff with -o and have strace write one file for each process (-ff).

Targeting in time: -t, -tt, -ttt, -T

Timestamps are extremely useful. You may need to order events in a strace file and a log file or want to know how long a system call took. The four “t” options of strace can help with this:

switchwhat it doesexample

-t	add a timestamp to each line	12:39:18 brk(0) = 0x8da000
-tt	add a timestamp with microseconds	12:39:34.601873 brk(0) = 0x18cc000
-ttt	add a timestamp with microseconds and the number of seconds since the epoch	1401557981.377712 brk(0) = 0x1509000
-T	add the time spent in the system call	using both -ttt and -T: 1401558007.457779 brk(0) = 0x1d59000 <0.000012>

Verbosity and string size: -v and -s

Neither of these options is used in the example, but they are worth mentioning because they can be very useful.

The string size option (-s) lets you change how much of a string being read or written is shown in the strace output. The default string length is 32. There are a few things to note:

Filenames are not considered strings and are always printed in full
strace will still tell you how long the string was, even if it doesn’t show you the whole thing
string length is not quite the same as the number of characters (because string length includes control characters like the newline and some UTF-8 characters are more than 1 unit long by strace’s counting method)

The -v (verbose) switch tells strace to print the more information about environment for many calls. This is very noticeable on the “stat” call. For instance, the first line of a strace without the verbose option looks like this:
1489 execve("./open-silent.pl", ["./open-silent.pl"], [/* 31 vars */]) = 0
with the verbose option, I see:
execve("./open-silent.pl", ["./open-silent.pl"], ["HOSTNAME=nt12199.product.cpanel."..., "TERM=xterm-256color", "SHELL=/bin/bash", "HISTSIZE=1000", "SSH_CLIENT=10.1.7.152 56078 22", "PERL5LIB=/home/mary/perl5/lib/pe"..., "OLDPWD=/home/mary", "PERL_MB_OPT=--install_base "/hom"..., "SSH_TTY=/dev/pts/0", "USER=mary", "LS_COLORS=rs=0:di=01;34:ln=01;36"..., "MAIL=/var/spool/mail/mary", "PATH=/usr/local/jdk/bin:/home/ma"..., "PWD=/home/mary/public_html", "JAVA_HOME=/usr/local/jdk", "EDITOR=pico", "LANG=en_US.UTF-8", "HISTCONTROL=ignoredups", "SHLVL=1", "HOME=/home/mary", "LS_OPTIONS=--color=tty -F -a -b "..., "PERL_LOCAL_LIB_ROOT=/home/mary/p"..., "LOGNAME=mary", "VISUAL=pico", "CVS_RSH=ssh", "CLASSPATH=.:/usr/local/jdk/lib/c"..., "SSH_CONNECTION=10.1.7.152 56078 "..., "LESSOPEN=|/usr/bin/lesspipe.sh %"..., "PERL_MM_OPT=INSTALL_BASE=/home/m"..., "G_BROKEN_FILENAMES=1", "_=/usr/bin/strace"]) = 0

Reading strace output

Using strace to document the activity of a very simple script like this one:
#!/bin/bash cd /home/
produces a 6.3K file when using a invoking strace with this command:
strace -o changedir ./chdir.sh

The script is functionally useless, but stracing it does produce some output for us to examine. Please do try this at home; it’s the best way to learn. 🙂 When you do, you’ll see that the first line in the strace is:

execve("./chdir.sh", ["./chdir.sh"], [/* 22 vars */]) = 0

You’ll see a line similar to this (with the execve call) when you start strace with a command. You will not see it if you attach strace to an already running process. The majority of the file contains information that we don’t care about as beginners, but there are a few items of interest. 🙂

File Handles

The second line of interest is:

open("./chdir.sh", O_RDONLY) = 3

This is the operating system opening the script we’re running. It’s using the open system call to open the file ./chdir.sh in readonly mode. It’s assigned it file handle #3. After it opens the file, it reads it. Strace does not give you the name of the file ever again – it only tells you what file handle it’s reading from:

read(3, "#!/bin/bashnncd /home/nn", 80) = 24

The above line tells you that the system read 24 bytes from file handle #3. It even helpfully tells you the content of those 24 bytes, because it’s short. Normally the entire amount read by the system will not be included in the strace – only the first small amount.

To determine what that file is, you have to scroll up and see what the last file that was opened successfully with that file handle. Linux does reuse file handles, so it’s not uncommon to see something like:

open("./chdir.sh", O_RDONLY) = 3 read(3, "#!/bin/bashnncd /home/nn", 80) = 24 close(3) open("./read.sh", O_RDONLY) = 3 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffbfda78c0) = -1 ENOTTY (Inappropriate ioctl for device)

In this case, the error is not important – the kernel is simply telling the program that the file ./read.sh is not a terminal. You will see this frequently in strace output and can ignore it. The important detail to remember here is that file handles are re-used, and it’s important to read up from the line your interested in to find what file is currently being referenced by the file handle.

File handles 1 and 2

File handles 1 and 2 are special. File handle 1 is standard input and file handle 2 is standard output. Standard error is another special file handle that, if used, frequently takes up file handle 3. A tutorial on standard input, output, and error can be found at: http://www.linfo.org/stdio.html.

Manual pages (getting more info)

The linux manual pages are divided into sections. Section 2 is for system calls, like the ones you will see in strace output. Each system call has it’s own manual page that includes all the things that are allowed to go as arguments (between the parentheses) and all the return values (things that can appear after the “=” sign).

At this stage, the manual page is mostly useful for looking up things that appear in the strace output. For instance, an example used later will contain the following output:

open("./chdir.sh", O_RDONLY) = 3

If you want to know what O_RDONLY means, you can either search the web for “man open linux” and use your browsers “find on page” function to find the information regarding O_RDONLY or you can use man 2 open and (by default) the “/” key to search inside the manpage. The first instance of O_RDONLY on the page says:

The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request opening the file read-only, write-only, or read/write, respectively.

We now know that the file was opened read only. Lets say that instead of the above, output, we saw:

open("./chdir.sh", O_RDONLY) = EACCES

We then search on the manual page for EACCES and find:

The requested access to the file is not allowed, or search permission is denied for one of the directories in the path prefix of pathname, or the file did not exist yet and write access to the parent directory is not allowed. (See also path_resolution(7).

This is a little verbose, but it basically says that this is an access error – like permission denied.

Example: Permission Issues

Permission issues are very common on linux servers, and can sometimes be hard to track down. Strace can be a helpful tool for many permissions issues. With cPanel, you’ll be using strace on a lot of perl scripts. It’s also likely that there will be multiple errors in one script. This example shows two permissions and a missing file issue.

Create a file in your working directory called “open-warn.pl” and copy the contents below into it. Ensure that it is set to be executable (chmod +x open-warn.pl)
Create a file called “test-read” that contains a short string, like “This is a test file”, and ensure that it has 0 permissions.
Create a directory called “content”, and copy the test-read file into it.
Ensure that the content folder has 0 permissions, and that the test-read file inside the content folder it is set to 644.
Ensure that there is no file named “test-read-missing” in the working directory.
Run strace on open-warn.pl, following forks and outputting to the file perl-test.strace

The file for step 1:
[~/public_html]# cat open-warn.pl #!/usr/bin/perl

print “Testing bad file permissionsn”;
$filename = “test-read”;
open( FILE, “< $filename” ) or warn “Can’t open $filename : $!”;
close FILE;

print “Testing bad folder permissionsn”;
$filename = “content/test-read”;
open( FILE, “< $filename” ) or warn “Can’t open $filename : $!”;
close FILE;

print “Testing missing filen”;
$filename = “test-read-missing”;
open( FILE, “< $filename” ) or warn “Can’t open $filename : $!”;
close FILE;

The “n” at the end of each print statement tells perl to insert a newline. It also tells perl to flush the print buffer (print right now). Without them, all the error messages would print above all the print statements. (Go ahead and try it)

The output from step 5:
[~/public_html]# strace -f -o perl-test.strace ./open-warn.pl Testing bad file permissions Can't open test-read : Permission denied at ./open-warn.pl line 5. Testing bad folder permissions Can't open content/test-read : Permission denied at ./open-warn.pl line 10. Testing missing file Can't open test-read-missing : No such file or directory at ./open-warn.pl line 15. [~/public_html]#

So we have two “Permission Denied” and one “No such file” error. We can grep for either EACCES or “Permission denied” in the strace and find the errors quickly and easily:

[~/public_html]# grep EACCES ../perl-test.strace 8817 open("test-read", O_RDONLY) = -1 EACCES (Permission denied) 8817 open("content/test-read", O_RDONLY) = -1 EACCES (Permission denied)

When we grep for ENOENT or “No such file”, however, we get quite a few more. Below I have two examples of pulling out some of the lines I know I don’t need. Unfortunately, determining what you can safely exclude is highly context dependent.

[~/public_html]# grep ENOENT ../perl-test.strace | grep -v "/usr/" 8817 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) 8817 stat("/home/mary/perl5/lib/perl5/5.10.1/x86_64-linux-thread-multi", 0x7fff625b1400) = -1 ENOENT (No such file or directory) 8817 stat("/home/mary/perl5/lib/perl5/5.10.1", 0x7fff625b1400) = -1 ENOENT (No such file or directory) 8817 stat("/home/mary/perl5/lib/perl5/x86_64-linux-thread-multi", 0x7fff625b1400) = -1 ENOENT (No such file or directory) 8817 stat("/home/mary/perl5/lib/perl5/5.10.0", 0x7fff625b1400) = -1 ENOENT (No such file or directory) 8817 open("test-read-missing", O_RDONLY) = -1 ENOENT (No such file or directory)
or
[~/public_html]# grep ENOENT ../perl-test.strace | grep -v "perl5" 8817 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) 8817 open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) 8817 open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) 8817 open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) 8817 open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) 8817 open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) 8817 open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) 8817 open("test-read-missing", O_RDONLY) = -1 ENOENT (No such file or directory)

Either way, it’s the last line that gives us what we’re looking for.

Now, this was a little easy – we knew what to look for because the error message told us. Most of the time, you’ll be using strace because the error messages are not so helpful. Lets modify our open-warn.pl program and call it open-silent.pl:
[~/public_html]# cat open-silent.pl #!/usr/bin/perl

print “Testing bad file permissions”;
$filename = “test-read”;
open( FILE, “< $filename” ) or warn “Can’t open an important file”;
close FILE;

print “Testing bad folder permissions”;
$filename = “content/test-read”;
open( FILE, “< $filename” ) or warn “Can’t open an important file”;
close FILE;

print “Testing missing file”;
$filename = “test-read-missing”;
open( FILE, “< $filename” ) or warn “Can’t open an important file”;
close FILE;

Now when we run strace we get output like this:
[~/public_html]# strace -f -o perl-quiet.strace ./open-silent.pl Can't open an important file at ./open-silent.pl line 5. Can't open an important file at ./open-silent.pl line 10. Can't open an important file at ./open-silent.pl line 15. Testing bad file permissionsTesting bad folder permissionsTesting missing [~/public_html]#

Now, it still gives you some hints – like the line number. But every single one of those lines reads:
open( FILE, "< $filename" ) or warn "Can't open an important file";

If the program was long and complicated, we’d have no way of knowing if it was the same file that was being attempted multiple times or if there were multiple files. Now the strace file tells us something useful. I would start by checking for permission errors, since they are the most common reason that a file can’t be opened:
[~/public_html]# grep EACCES perl-quiet.strace 1489 open("test-read", O_RDONLY) = -1 EACCES (Permission denied) 1489 open("content/test-read", O_RDONLY) = -1 EACCES (Permission denied)

That gives us two of our three errors. Honestly, if I were debugging for real, at this point I would just check all the permissions on public_html and it’s subfolders and files. Because permission issues usually come in large batches on a webserver. 🙂

Once I’ve fixed the permission issues on test-read and content, I can run strace again. Now I only have one error message. I’m going to check for permission issues again, because sometimes when you fix one permission issue (especially on a directory) it uncovers more permission issues. Unfortunately for me, there are no more permission issues:

[~/public_html]# grep EACCES perl-quiet2.strace [~/public_html]#

So now I have to think of what else might cause a file not to be opened. Maybe the file just isn’t there. So I grep for ENOENT and man there’s a lot of output. I always start at the bottom – especially if the error stops the program from running to completion. In this case, the last line is:

1595 open("test-read-missing", O_RDONLY) = -1 ENOENT (No such file or directory)

Now I know that a program related file is missing. Depending on what my actual application is depends on how I fix this. In this case, I’m just going to touch the file to create it. Now when I run the script (./open-silent.pl) it does not output any error messages. 🙂