Canonical Voices

What Alex Chiang talks about

Posts tagged with 'memory'

alex

Appy polly loggies for the super long delay between episodes, but I finally carved out some time for our exciting dénouement in the memory leak detection series. Past episodes included detection and analysis.

As a gentle reminder, during analysis, we saw the following block of code:

 874                 GSList *dupes = NULL;
 875                 const char *path;
 876 
 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupes");
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 880 #endif
 881                 return NULL;

And we concluded with:

Is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?
We can definitively say that it is not safe to return NULL without doing anything to dupes. We definitely allocated memory, stuck it into dupes, and then threw dupes away. This is our smoking gun.

But there’s a twist! Eagle-eyed reader Dave Jackson (a former colleague of mine from HP, natch) spotted a second leak! It turns out that line 879 was exceptionally leaky during its inception. As Dave points out, the call to g_slist_prepend() passes g_strdup() as an argument. And as the documentation says:

Duplicates a string. If str is NULL it returns NULL. The returned string should be freed with g_free() when no longer needed.

In memory-managed languages like python, the above idiom of passing a function as an argument to another function is quite common. However, one needs to be more careful about doing so in C and C++, taking great care to observe if your function-as-argument allocates memory and returns it. There is no mechanism in the language itself to automatically free memory in the above situation, and thus the call to g_strdup() seems like it also leaks memory. Yowza!

So, what to do about it?

The basic goal here is that we don’t want to throw dupes away. We need to actually do something with it. Here again are the 3 most pertinent lines.

 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupes");
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 881                 return NULL;

Let’s break these lines down.

  1. On line 877, we retrieve the dupes list from the dup_data.found object
  2. Line 878 gets a path to the duplicate wifi access point
  3. Finally, line 879 adds the duplicate access point to the old dupes list
  4. Line 881 throws it all away!

To me, the obvious thing to do is to change the code between lines 879 and 881, so that after we modify the duplicates list, we save it back into the dup_data object. That way, the next time around, the list stored inside of dup_data will have our updated list. Makes sense, right?

As long as you agree with me conceptually (and I hope you do), I’m going to take a quick shortcut and show you the end result of how to store the new list back into the dup_data object. The reason for the shortcut is that we are now deep in the details of how to program using the glib API, and like many powerful APIs, the key is to know which functions are necessary to accomplish your goal. Since this is a memory leak tutorial and not a glib API tutorial, just trust me that the patch hunk will properly store the dupes list back into the dup_data object. And if it’s confusing, as always, read the documentation for g_object_steal_data and g_object_set_data_full.

@@ -706,14 +706,15 @@
 +		GSList *dupes = NULL;
 +		const char *path;
 +
-+		dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupes");
++		dupes = g_object_steal_data (G_OBJECT (dup_data.found), "dupes");
 +		path = nm_object_get_path (NM_OBJECT (ap));
 +		dupes = g_slist_prepend (dupes, g_strdup (path));
++		g_object_set_data_full (G_OBJECT (dup_data.found), "dupes", (gpointer) dupes, (GDestroyNotify) clear_dupes_list);
 +#endif
  		return NULL;
  	}

If the above patch format looks funny to you, it’s because we are changing a patch.

-+		dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupes");
++		dupes = g_object_steal_data (G_OBJECT (dup_data.found), "dupes");

This means the old patch had the line calling g_object_get_data() and the refreshed patch now calls g_object_steal_data() instead. Likewise…

++		g_object_set_data_full (G_OBJECT (dup_data.found), "dupes", (gpointer) dupes, (GDestroyNotify) clear_dupes_list);

The above call to g_object_set_data_full is a brand new line in the new and improved patch.

Totally clear, right? Don’t worry, the more sitting and contemplating of the above you do, the fuller and more awesomer your neckbeard grows. Don’t forget to check it every once in a while for small woodland creatures who may have taken up residence there.

And thus concludes our series on how to detect, analyze, and fix memory leaks. All good? Good.

waiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiit!!!!!!!11!

I can hear the observant readers out there already frantically scratching their necks and getting ready to point out the mistake I made. After all, our newly refreshed patch still contains this line:

 +		dupes = g_slist_prepend (dupes, g_strdup (path));

And as we determined earlier, that’s our incepted memory leak, right? RIGHT!?‽

Not so fast. Take a look at the new line in our updated patch:

++		g_object_set_data_full (G_OBJECT (dup_data.found), "dupes", (gpointer) dupes, (GDestroyNotify) clear_dupes_list);

See that? The last argument to g_object_set_data_full() looks quite interesting indeed. It is in fact, a cleanup function named clear_dupes_list(), which according to the documentation, will be called

when the association is destroyed, either by setting it to a different value or when the object is destroyed.

In other words, when we are ready to get rid of the dup_data.found object, as part of cleaning up that object, we’ll call the clear_dupes_list() function. And what does clear_dupes_list() do, praytell? Why, let me show you!

static void
clear_dupes_list (GSList *list)
{
	g_slist_foreach (list, (GFunc) g_free, NULL);
	g_slist_free (list);
}

Trés interesante! You can see that we iterate across the dupes list, and call g_free on each of the strings we did a g_strdup() on before. So there wasn’t an inception leak after all. Tricky tricky.

A quick digression is warranted here. Contrary to popular belief, it is possible to write object oriented code in plain old C, with inheritance, method overrides, and even some level of “automatic” memory management. You don’t need to use C++ or python or whatever the web programmers are using these days. It’s just that in C, you build the OO features you want yourself, using primitives such as structs and function pointers and smart interface design.

Notice above we have specified that whenever the dup_data object is destroyed, it will free the memory that was stuffed into it. Yes, we had to specify the cleanup function manually, but we are thinking of our data structures in terms of objects.

In fact, the fancy features of many dynamic languages are implemented just this way, with the language keeping track of your objects for you, allocating them when you need, and freeing them when you’re done with them.

Because at the end of the day, it is decidedly not turtles all the way down to the CPU. When you touch memory in in python or ruby or javascript, I guarantee that something is doing the bookkeeping on your memory, and since CPUs only understand assembly language, and C is really just pretty assembly, you now have a decent idea of how those fancy languages actually manage memory on your behalf.

And finally now that you’ve seen just how tedious and verbose it is to track all this memory, it should no longer be a surprise to you that most fancy languages are slower than C. Paperwork. It’s always paperwork.

And here we come to the upshot, which is, tracking down memory leaks can be slow and time consuming and trickier than first imagined (sorry for the early head fake). But with the judicious application of science and taking good field notes, it’s ultimately just like putting a delicious pork butt in the slow cooker for 24 hours. Worth the wait, worth the effort, and it has a delicious smoky sweet payoff.

Happy hunting!


kalua pork + homemade mayo and cabbage

Read more
alex

In our last exciting episode, we learned how to capture a valgrind log. Today we’re going to take the next step and learn how to actually use it to debug memory leaks.

There are a few prerequisites:

  1. know C. If you don’t know it, go read the C programming language which is often referred to as K&R C. Be sure to understand the sections on pointers, and after you do, come back to my blog. See you in 2 weeks!
  2. a nice supply of your favorite beverages and snacks. I prefer coffee and bacon, myself. Get ready because you’re about to read an epic 2276 word blog entry.

That’s it. Ok, ready? Let’s go!

navigate the valgrind log
Open the valgrind log that you collected. If you don’t have one, you can grab one that I’ve already collected. Take a deep breath. It looks scary but it’s not so bad. I like to skip straight to the good part near the bottom. Search the file for “LEAK SUMMARY”. You’ll see something like:

==13124== LEAK SUMMARY:
==13124==    definitely lost: 916,130 bytes in 37,528 blocks
==13124==    indirectly lost: 531,034 bytes in 12,735 blocks
==13124==      possibly lost: 82,297 bytes in 891 blocks
==13124==    still reachable: 2,578,733 bytes in 42,856 blocks
==13124==         suppressed: 0 bytes in 0 blocks
==13124== Reachable blocks (those to which a pointer was found) are not shown.
==13124== To see them, rerun with: --leak-check=full --show-reachable=yes

You can see that valgrind thinks we’ve definitely leaked some memory. So let’s go figure out what leaked.

Valgrind lists all the leaks, in order from smallest to largest. The leaks are also categorized as “possibly” or “definitely”. We’ll want to focus on “definitely” for now. Right above the summary, you’ll see the worst, definite leak:

==13124== 317,347 (77,312 direct, 240,035 indirect) bytes in 4,832 blocks are definitely lost in loss record 10,353 of 10,353
==13124==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13124==    by 0x74E3A78: g_malloc (gmem.c:159)
==13124==    by 0x74F6CA2: g_slice_alloc (gslice.c:1003)
==13124==    by 0x74F7ABD: g_slist_prepend (gslist.c:265)
==13124==    by 0x4275A4: get_menu_item_for_ap (applet-device-wifi.c:879)
==13124==    by 0x427ACE: wireless_add_menu_item (applet-device-wifi.c:1138)
==13124==    by 0x41815B: nma_menu_show_cb (applet.c:1643)
==13124==    by 0x4189EC: applet_update_indicator_menu (applet.c:2218)
==13124==    by 0x74DDD52: g_main_context_dispatch (gmain.c:2539)
==13124==    by 0x74DE09F: g_main_context_iterate.isra.23 (gmain.c:3146)
==13124==    by 0x74DE499: g_main_loop_run (gmain.c:3340)
==13124==    by 0x414266: main (main.c:106)

Wow, we lost 300K of memory in just a few hours. Now imagine if you don’t reboot your laptop for a week. Yeah, that’s not so good. Time for a coffee and bacon break, the next part is about to get fun.

read the stack trace
What you saw above is a stack trace, and it’s printed chronologically “backwards”. In this example, malloc() was called by g_malloc(), which was called by g_slice_alloc(), which in turn was called by g_slist_prepend(), which itself was called by get_menu_item_for_ap() and so forth. The first function ever called was main(), which should hopefully make sense.

At this point, we need to use a little bit of extra knowledge to understand what is happening. The first function, main() is in our program, nm-applet. That’s fairly easy to understand. However, the next few functions that begin with g_main_ don’t actually live inside nm-applet. They are part of glib, which is a library that nm-applet depends on. I happened to have just known this off the top of my head, but if you’re ever unsure, you can just google for the function name. After searching, we can see that those functions are in glib, and while there is some magic that is happening, we can blissfully ignore it because we see that we soon jump back into nm-applet code, starting with applet_update_indicator_menu().

a quick side note
Many Linux programs will have a stack trace similar to the above. The program starts off in its own main(), but will call various other libraries on your system, such as glib, and then jump back to itself. What’s going on? Well, glib provides a feature known as a “main loop” which is used by the program to look for inputs and events, and then react to them. It’s a common programming paradigm, and rather than have every application in the world write their own main loop, it’s easier if everyone just uses the one provided by glib.

The other observation is to note how the function names appear prominently in the stack trace. Pundits wryly say that naming things is one of the hardest things in computer science, and I completely agree. So take care when naming your functions, because people other than you will definitely see them and need to understand them!

Alright, let’s get back to the stack trace. We can see a few functions that look like they belong to nm-applet, based on their names and their associated filenames. For example, the function wireless_add_menu_item() is in the file applet-device-wifi.c on line 1138. Now you see why we wanted symbols from the last episode. Without the debug symbols, all we would have seen would have been a bunch of useless ??? and we’d be gnashing our teeth and wishing for more bacon right now.

Finally, we see a few more g_* functions, which means we’re back in the memory allocation functions provided by glib. It’s important to understand at this point that g_malloc() is not the memory leak. g_malloc() is simply doing whatever nm-applet asks it to do, which is to allocate memory. The leak is highly likely to be in nm-applet losing a reference to the pointer returned by g_malloc().

What does it mean?
Now we’re ready to start the real debugging. We know approximately where we are leaking memory inside nm-applet: get_menu_item_for_ap() which is the last function before calling the g_* memory functions. Time to top off on coffee because we’re about to get our hands dirty.

reading the source
The whole point of open source is being able to read the source. Are you as excited as I am? I know you are!

First, let’s get the source to nm-applet. Assuming you are using Ubuntu and you are using 12.04, you’d simply say:

$ cd Projects
$ mkdir network-manager-gnome
$ cd network-manager-gnome
$ apt-get source network-manager-gnome
$ cd network-manager-applet-0.9.4.1

Woo hoo! That wasn’t hard, right?

side note #2
Contrary to popular belief, reading code is harder than writing code. When you write code, you are transmitting the thoughts of your messy brain into an editor, and as long as it kinda works, you’re happy. When you read code, now you’re faced with the problem of trying to understand exactly what the previous messy brain wrote down and making sense of it. Depending on how messy that previous brain was, you may have real trouble understanding the code. This is where pencil and paper and plenty of coffee come into play, where you literally trace through what the program is doing to try and understand it.

Luckily there are at least a few tools to help you do this. My favorite tools are cscope and ctags, which help me to rapidly understand the skeleton of a program and navigate around its complex structure.

Assuming you are in the network-manager-applet-0.9.4.1 source tree:

$ apt-get install cscope ctags
$ cscope -bqR
$ ctags -R
$ cscope -dp4


You are now presented with a menu. Use control-n and control-p to navigate input fields at the bottom. Try navigating to “Find this C symbol:” and then type in get_menu_item_for_ap, and press enter. The search results are returned, and you can press ’0′ or ’1′ to jump to either of the locations where the function is referenced. You can also press the space bar to see the final search result. Play around with some of the other search types and see what happens. I’ll talk about ctags in a bit.

Alrighty, let’s go looking for our suspicious nm-applet function. Start up cscope as described above. Navigate to “Find this global definition:” and search for get_menu_item_for_ap. cscope should just directly put you in the right spot.

Based on our stack trace, it looks like we’re doing something suspicious on line 879, so let’s go see what it means.

 869         if (dup_data.found) {
 870 #ifndef ENABLE_INDICATOR
 871                 nm_network_menu_item_best_strength (dup_data.found, nm_acce
 872                 nm_network_menu_item_add_dupe (dup_data.found, ap);
 873 #else
 874                 GSList *dupes = NULL;
 875                 const char *path;
 876 
 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupe
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 880 #endif
 881                 return NULL;
 882         }

Cool, we can now see where the source code is matching up with the valgrind log.

Let’s start doing some analysis. The first thing to note are the #ifdef blocks on lines 870, 873, and 880. You should know that ENABLE_INDICATOR is defined, meaning we do not execute the code in lines 871 and 872. Instead, we do lines 874 to 879, and then we do 881. Why do we do 881 if it is after the #endif? That’s because we fell off the end of the #ifdef block, and then we do whatever is next, after we fall off, namely returning NULL.

Don’t worry, I don’t know what’s going on yet, either. Time for a refill!

Back? Great. Alright, valgrind says that we’re doing something funky with g_slist_prepend().

==13124==    by 0x74F7ABD: g_slist_prepend (gslist.c:265)

And our relevant code is:

 874                 GSList *dupes = NULL;
 875                 const char *path;
 876 
 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupe
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));
 880 #endif
 881                 return NULL;

We can see that we declare the pointer *dupes on line 874, but we don’t do anything with it. Then, we assign something to it on line 877. Then, we assign something to it again on line 879. Finally, we end up not doing anything with *dupes at all, and just return NULL on line 881.

This definitely seems weird and worth a second glance. At this point, I’m asking myself the following questions:

  • did g_object_get_data() allocate memory?
  • did g_slist_prepend() allocate memory?
  • are we overwriting *dupes on line 879? that might be a leak.
  • is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?

Let’s take them in order.

did g_object_get_data() allocate memory?
g_object_get_data has online documentation, so that’s our first stop. The documentation says:

Returns :
the data if found, or NULL if no such data exists. [transfer none]

Since I am not 100% familiar with glib terminology, I guess [transfer none] means that g_object_get_data() doesn’t actually allocate memory on its own. But let’s be 100% sure. Time to grab the glib source and find out for ourselves.

$ apt-get source libglib2.0-0
$ cd glib2.0-2.32.1
$ cscope -bqR
$ ctags -R
$ cscope -dp4
search for global definition of g_object_get_data

Pretty simple function.

3208 gpointer
3209 g_object_get_data (GObject     *object,
3210                    const gchar *key)
3211 {
3212   g_return_val_if_fail (G_IS_OBJECT (object), NULL);
3213   g_return_val_if_fail (key != NULL, NULL);
3214 
3215   return g_datalist_get_data (&object->qdata, key);
3216 }

Except I have no idea what g_datalist_get_data() does. Maybe that guy is allocating memory. Now I’ll use ctags to make my life easier. In vim, put your cursor over the “g” in “g_datalist_get_data” and then press control-]. This will “step into” the function. Magic!

 844 gpointer
 845 g_datalist_get_data (GData       **datalist,
 846                      const gchar *key)
 847 {
 848   gpointer res = NULL; 
 ... 
 856   d = G_DATALIST_GET_POINTER (datalist);
 ...
 859       data = d->data;
 860       data_end = data + d->len;
 861       while (data < data_end)
 862         {
 863           if (strcmp (g_quark_to_string (data->key), key) == 0)
 864             {
 865               res = data->data;
 866               break;
 867             }
 868           data++;
 869         }
 ... 
 874   return res;
 875 }

This is a pretty simple loop, walking through an existing list of pointers which have already been allocated somewhere else, starting on line 861. We do our comparison on line 863, and if we get a match, we assign whatever we found to res on line 865. Note that all we are doing here is a simple assignment. We are not allocating any memory!

Finally, we return our pointer on line 874. Press control-t in vim to pop back to your last location.

Now we know for sure that g_object_get_data() and g_datalist_get_data() do not allocate any memory at all, so there can be no possibility of a leak here. Let’s try the next function.

did g_slist_prepend() allocate memory?
First, read the documentation, which says:

The return value is the new start of the list, which may have changed, so make sure you store the new value.

This probably means it allocates memory for us, but let’s double-check just to be sure. Back to cscope!

 259 GSList*
 260 g_slist_prepend (GSList   *list,
 261                  gpointer  data)
 262 {
 263   GSList *new_list;
 264 
 265   new_list = _g_slist_alloc ();
 266   new_list->data = data;
 267   new_list->next = list;
 268 
 269   return new_list;
 270 }

Ah ha! Look at line 265. We are 100% definitely allocating memory, and returning it on line 269. Things are looking up! Let’s keep going with our questions.

are we overwriting *dupes on line 879? that might be a leak.
Remember:

 877                 dupes = g_object_get_data (G_OBJECT (dup_data.found), "dupe
 878                 path = nm_object_get_path (NM_OBJECT (ap));
 879                 dupes = g_slist_prepend (dupes, g_strdup (path));

We’ve already proven to ourselves that line 877 doesn’t allocate any memory. It just sets dupes to some value. However, on line 879, we do allocate memory. It is equivalent to this code:

  int *dupes;
  dupes = 0x12345678;
  dupes = malloc(128);

So simply setting dupes to the return value of g_object_get_data() and later overwriting it with the return value of malloc() does not inherently cause a leak.

By way of counter-example, the below code is a memory leak:

  int *dupes;
  dupes = malloc(64);
  dupes = malloc(128);    /* leak! */

The above essentially illustrates the scenario I was worried about. I was worried that g_object_get_data() allocated memory, and then g_slist_prepend() also allocated memory which would have been a leak because the first value of dupes got scribbled over by the second value. My worry turned out to be incorrect, but that is the type of detective work you have to think about.

As a clearer example of why the above is a leak, consider the next snippet:

  int *dupes1, *dupes2;
  dupes1 = malloc(64);     /* ok */
  dupes2 = malloc(128);    /* ok */
  dupes1 = dupes2;         /* leak! */

First we allocate dupes1. Then allocate dupes2. Finally, we set dupes1 = dupes2, and now we have a leak. No one knows what the old value of dupes1 was, because we scribbled over it, and it is gone forever.

is it safe to just return NULL without doing anything to dupes? maybe that’s our leak?
We can definitively say that it is not safe to return NULL without doing anything to dupes. We definitely allocated memory, stuck it into dupes, and then threw dupes away. This is our smoking gun.

Next time, we’ll see how to actually fix the problem.

Read more