Canonical Voices

Posts tagged with 'ubuntu touch'

abeato

So there I was. I did have to use a proprietary library, for which I had no sources and no real hope of support from the creators. I built my program against it, I ran it, and I got a segmentation fault. An exception that seemed to happen inside that insidious library, which was of course stripped of all debugging information. I scratched my head, changed my code, checked traces, tried valgrind, strace, and other debugging tools, but found no obvious error. Finally, I assumed that I had to dig deeper and do some serious debugging of the library’s assembly code with gdb. The rest of the post is dedicated to the steps I followed to find out what was happening inside the wily proprietary library that we will call libProprietary. Prerequisites for this article are some knowledge of gdb and ARM architecture.

Some background on the task I was doing: I am a Canonical employee that works as developer for Ubuntu for Phones. In most, if not all, phones, the BSP code is not 100% open and we have to use proprietary libraries built for Android. Therefore, these libraries use bionic, Android’s libc implementation. As we want to call them inside binaries compiled with glibc, we resort to libhybris, an ingenious library that is able to load and call libraries compiled against bionic while the rest of the process uses glibc. This will turn out to be critical in this debugging. Note also that we are debugging ARM 32-bits binaries here.

The Debugging Session

To start, I made sure I had installed glibc and other libraries symbols and started to debug by using gdb in the usual way:

$ gdb myprogram
GNU gdb (Ubuntu 7.9-1ubuntu1) 7.9
...
Starting program: myprogram
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
[New Thread 0xf49de460 (LWP 7101)]
[New Thread 0xf31de460 (LWP 7104)]
[New Thread 0xf39de460 (LWP 7103)]
[New Thread 0xf41de460 (LWP 7102)]
[New Thread 0xf51de460 (LWP 7100)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf49de460 (LWP 7101)]
0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0xf520bd06 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info proc mappings
process 7097
Mapped address spaces:

	Start Addr   End Addr       Size     Offset objfile
	   0x10000    0x17000     0x7000        0x0 /usr/bin/myprogram
	...
	0xf41e0000 0xf49df000   0x7ff000        0x0 [stack:7101]
	...
	0xf51f6000 0xf5221000    0x2b000        0x0 /android/system/lib/libProprietary.so
	0xf5221000 0xf5222000     0x1000        0x0 
	0xf5222000 0xf5224000     0x2000    0x2b000 /android/system/lib/libProprietary.so
	0xf5224000 0xf5225000     0x1000    0x2d000 /android/system/lib/libProprietary.so
	...
(gdb)

We can see here that we get the promised crash. I execute a couple of gdb commands after that to see the backtrace and part of the process address space that will be of interest in the following discussion. The backtrace shows that a segment violation happened when the CPU tried to execute instructions in address zero, and we can see by checking the process mappings that the previous frame lives inside the text segment of libProprietary.so. There is no backtrace beyond that point, but that should come as no surprise as there is no DWARF information in libProprietary, and also noting that usage of frame pointer is optimized away quite commonly these days.

After this I tried to get a bit more information on the CPU state when the crash happened:

(gdb) info reg
r0             0x0	0
r1             0x0	0
r2             0x0	0
r3             0x9	9
r4             0x0	0
r5             0x0	0
r6             0x0	0
r7             0x0	0
r8             0x0	0
r9             0x0	0
r10            0x0	0
r11            0x0	0
r12            0xffffffff	4294967295
sp             0xf49dde70	0xf49dde70
lr             0xf520bd07	-182403833
pc             0x0	0x0
cpsr           0x60000010	1610612752
(gdb) disassemble 0xf520bd02,+10
Dump of assembler code from 0xf520bd02 to 0xf520bd0c:
   0xf520bd02:	b	0xf49c9cd6
   0xf520bd06:	movwpl	pc, #18628	; 0x48c4	<UNPREDICTABLE>
   0xf520bd0a:	andlt	r4, r11, r8, lsr #12
End of assembler dump.
(gdb) 

Hmm, we are starting to see weird things here. First, in 0xf520bd02 (which probably has been executed little before the crash) we get an unconditional branch to some point in the thread stack (see mappings in previous figure). Second, the instruction in 0xf520bd06 (which should be executed after returning from the procedure that provokes the crash) would load into the pc (program counter) an address that is not mapped: we saw that the first mapped address is 0x10000 in the previous figure. The movw instruction has also a “pl” suffix that makes the instruction execute only when the operand is positive or zero… which is obviously unnecessary as 0x48c4 is encoded in the instruction.

I resorted to doing objdump -d libProprietary.so to disassemble the library and compare with gdb output. objdump shows, in that part of the file (subtracting the library load address gives us the offset inside the file: 0xf520bd02-0xf51f6000=0x15d02):

   15d02:	f7f3 eade 	blx	92c0 <__android_log_print@plt>;
   15d06:	f8c4 5304 	str.w	r5, [r4, #772]	; 0x304
   15d0a:	4628      	mov	r0, r5
   15d0c:	b00b      	add	sp, #44	; 0x2c
   15d0e:	e8bd 8ff0 	ldmia.w	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}

which is completely different from what gdb shows! What is happening here? Taking a look at addresses for both code chunks, we see that instructions are always 4 bytes in gdb output, while they are 2 or 4 in objdump‘s. Well, you have guessed, don’t you? We are seeing “normal” ARM instructions in gdb, while objdump is decoding THUMB-2 instructions. Certainly objdump seems to be right here as the output is more sensible: we have a call to an executable part of the process space in 0x15d02 (it is resolved to a known function, __android_log_print), and the following instructions seems like a normal function epilogue in ARM: a return value is stored in r0, the sp (stack pointer) is incremented (we are freeing space in the stack), and we restore registers.

If we get back to the register values, we see that cpsr (current program status register [1]) does not have the T bit set, so gdb thinks we are using ARM instructions. We can change this by doing

(gdb) set $cpsr=0x60000030
(gdb) disass 0xf520bd02,+15
Dump of assembler code from 0xf520bd02 to 0xf520bd11:
   0xf520bd02:	blx	0xf51ff2c0
   0xf520bd06:	str.w	r5, [r4, #772]	; 0x304
   0xf520bd0a:	mov	r0, r5
   0xf520bd0c:	add	sp, #44	; 0x2c
   0xf520bd0e:	ldmia.w	sp!, {r4, r5, r6, r7, r8, r9, r10, r11, pc}
End of assembler dump.

Ok, much better now [2]. The thumb bit in cpsr is determined by last bx/blx call: if the address is odd, the procedure to which we are calling contains THUMB instructions, otherwise they are ARM (a good reference for these instructions is [3]). In this case, after an exception the CPU moves to arm mode, and gdb is unable to know which is the right mode when disassembling. We can search for hints on which parts of the code are arm/thumb by looking at the values in registers used by bx/blx, or by looking at the lr (link register): we can see above that the value after the crash was 0xf520bd07, which is odd and indicates that 0xf520bd06 contains a thumb instruction. However, for some reason gdb is not able to take advantage of this information.

Of course this problem does not happen if we have debugging information: in that case we have special symbols that let gdb know if the section where the code is contains thumb instructions or not [4]. As those are not found, gdb uses the cpsr value. Here objdump seems to have better heuristics though.

After solving this issue with instruction decoding, I started to debug __android_log_print to check what was happening there, as it looked like the crash was happening in that call. I spent quite a lot of time there, but found nothing. All looked fine, and I started to despair. Until I inserted a breakpoint in address 0xf520bd06, right after the call to __android_log_print, run the program… and it stopped at that address, no crash happened. I started to execute the program instruction by instruction after that:

(gdb) b *0xf520bd06
(gdb) run
...
Breakpoint 1, 0xf520bd06 in ?? ()
(gdb) si
0xf520bd0a in ?? ()
(gdb) si
0xf520bd0c in ?? ()
(gdb) si
0xf520bd0e in ?? ()
Warning:
Cannot insert breakpoint 0.
Cannot access memory at address 0x0

Something was apparently wrong with instruction ldmia, which restores registers, including the pc, from the stack. I took a look at the stack in that moment (taking into account that ldmia had already modified the sp after restoring 9 registers == 36 bytes):

(gdb) x/16xw $sp-36
0xf49dde4c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde5c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde6c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde7c:	0x00000000	0x00000000	0x00000000	0x00000000

All zeros! At this point it is clear that this is the real point where the crash is happening, as we are loading 0 into the pc. This looked clearly like a stack corruption issue.

But, before moving forward, why are we getting a wrong backtrace from gdb? Well, gdb is seeing a corrupted stack, so it is not able to unwind it. It would not be able to unwind it even if having full debug information. The only hint it has is the lr. This register contains the return address after execution of a bl/blx instruction [3]. If the called procedure is non-leaf, it is saved in the prologue, and restored in the epilogue, because it gets overwritten when branching to other procedures. In this case, it is restored on the pc and sometimes it is also saved back in the lr, depending on whether we have arm-thumb interworking built in the procedure or not [5]. It is not overwritten if we have a leaf procedure (as there are no procedure calls inside these).

As gdb has no additional information, it uses the lr to build the backtrace, assuming we are in a leaf procedure. However this is not true and the backtrace turns out to be wrong. Nonetheless, this information was not completely useless: lr was pointing to the instruction right after the last bl/blx instruction that was executed, which was not that far away from the real point where the program was crashing. This happened because fortunately __android_log_print has interworking code and restores the lr, otherwise the value of lr could have been from a point much far away from the point where the real crash happens. Believe or not, but it could have been even worse!

Having now a clear idea of where and why the crash was happening, things accelerated. The procedure where the crash happened, as disassembled by objdump, was (I include here only the more relevant parts of the code)

00015b1c <ProprietaryProcedure@@Base>:
   15b1c:	e92d 4ff0 	stmdb	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
   15b20:	b08b      	sub	sp, #44	; 0x2c
   15b22:	497c      	ldr	r1, [pc, #496]	; (15d14 <ProprietaryProcedure@@Base+0x1f8>)
   15b24:	2500      	movs	r5, #0
   15b26:	9500      	str	r5, [sp, #0]
   15b28:	4604      	mov	r4, r0
   15b2a:	4479      	add	r1, pc
   15b2c:	462b      	mov	r3, r5
   15b2e:	f8df 81e8 	ldr.w	r8, [pc, #488]	; 15d18 <ProprietaryProcedure@@Base+0x1fc>
   15b32:	462a      	mov	r2, r5
   15b34:	f8df 91e4 	ldr.w	r9, [pc, #484]	; 15d1c <ProprietaryProcedure@@Base+0x200>
   15b38:	ae06      	add	r6, sp, #24
   15b3a:	f8df a1e4 	ldr.w	sl, [pc, #484]	; 15d20 <ProprietaryProcedure@@Base+0x204>
   15b3e:	200f      	movs	r0, #15
   15b40:	f8df b1e0 	ldr.w	fp, [pc, #480]	; 15d24 <ProprietaryProcedure@@Base+0x208>
   15b44:	f7f3 ef76 	blx	9a34 <prctl@plt>
   15b48:	44f8      	add	r8, pc
   15b4a:	4629      	mov	r1, r5
   15b4c:	44f9      	add	r9, pc
   15b4e:	2210      	movs	r2, #16
   15b50:	44fa      	add	sl, pc
   15b52:	4630      	mov	r0, r6
   15b54:	44fb      	add	fp, pc
   15b56:	f7f3 ea40 	blx	8fd8 <memset@plt>
   15b5a:	a807      	add	r0, sp, #28
   15b5c:	f7f3 ef70 	blx	9a40 <sigemptyset@plt>
   15b60:	4b71      	ldr	r3, [pc, #452]	; (15d28 <ProprietaryProcedure@@Base+0x20c>)
   15b62:	462a      	mov	r2, r5
   15b64:	9508      	str	r5, [sp, #32]
   15b66:	4631      	mov	r1, r6
   15b68:	447b      	add	r3, pc
   15b6a:	681b      	ldr	r3, [r3, #0]
   15b6c:	200a      	movs	r0, #10
   15b6e:	9306      	str	r3, [sp, #24]
   15b70:	f7f3 ef6c 	blx	9a4c <sigaction@plt>
   ...
   15d02:	f7f3 eade 	blx	92c0 <__android_log_print@plt>
   15d06:	f8c4 5304 	str.w	r5, [r4, #772]	; 0x304
   15d0a:	4628      	mov	r0, r5
   15d0c:	b00b      	add	sp, #44	; 0x2c
   15d0e:	e8bd 8ff0 	ldmia.w	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}

The addresses where this code is loaded can be easily computed by adding 0xf51f6000 to the file offsets shown in the first column. We see that a few calls to different external functions [6] are performed by ProprietaryProcedure, which is itself an exported symbol.

I restarted the debug session, added a breakpoint at the start of ProprietaryProcedure, right after stmdb saves the state, and checked the stack values:

(gdb) b *0xf520bb20
Breakpoint 1 at 0xf520bb20
(gdb) cont
...
Breakpoint 1, 0xf520bb20 in ?? ()
(gdb) p $sp
$1 = (void *) 0xf49dde4c
(gdb) x/16xw $sp
0xf49dde4c:	0xf49de460	0x0007df00	0x00000000	0xf49dde70
0xf49dde5c:	0xf49de694	0x00000000	0xf77e9000	0x00000000
0xf49dde6c:	0xf75b4491	0x00000000	0xf49de460	0x00000000
0xf49dde7c:	0x00000000	0xfd5b4eba	0xfe9dd4a3	0xf49de460

We can see that the stack contains something, including a return address that looks valid (0xf75b4491). Note also that the procedure must never touch this part of the stack, as it belongs to the caller of ProprietaryProcedure.

Now it is a simply a matter of bisecting the code between the beginning and the end of ProprietaryProcedure to find out where we are clobbering the stack. I will save you of developing here this tedious process. Instead, I will just show, that, in the end, it turned out that the call to sigemptyset() is the culprit [7]:

(gdb) b *0xf520bb5c
Breakpoint 1 at 0xf520bb5c
(gdb) b *0xf520bb60
Breakpoint 2 at 0xf520bb60
(gdb) run
Breakpoint 1, 0xf520bb5c in ?? ()
(gdb) x/16xw 0xf49dde4c
0xf49dde4c:	0xf49de460	0x0007df00	0x00000000	0xf49dde70
0xf49dde5c:	0xf49de694	0x00000000	0xf77e9000	0x00000000
0xf49dde6c:	0xf75b4491	0x00000000	0xf49de460	0x00000000
0xf49dde7c:	0x00000000	0xfd5b4eba	0xfe9dd4a3	0xf49de460
(gdb) cont
Continuing.
Breakpoint 2, 0xf520bb60 in ?? ()
(gdb) x/16xw 0xf49dde4c
0xf49dde4c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde5c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde6c:	0x00000000	0x00000000	0x00000000	0x00000000
0xf49dde7c:	0x00000000	0x00000000	0x00000000	0x00000000

Note here that I am printing the part of the stack not reserved by the function (0xf49dde4c is the value of the sp before execution of the line at offset 0x15b20, see the code).

What is going wrong here? Now, remember that at the beginning of the article I mentioned that we were using libhybris. libProprietary assumes a bionic environment, and the libc functions it calls are from bionic’s libc. However, libhybris has hooks for some bionic functions: for them bionic is not called, instead the hook is invoked. libhybris does this to avoid conflicts between bionic and glibc: for instance having two allocators fighting for process address space is a recipe for disaster, so malloc() and related functions are hooked and the hooks call in the end the glibc implementation. Signals related functions were hooked too, including sigemptyset(), and in this case the hook simply called glibc implementation.

I looked at glibc and bionic implementations, in both cases sigemptyset() is a very simple utility function that clears with memset() a sigset_t variable. All pointed to different definitions of sigset_t depending on the library. Definition turned out to be a bit messy when looking at the code as it depended on build time definitions, so I resorted to gdb to print the type. For a executable compiled for glibc, I saw

(gdb) ptype sigset_t
type = struct {
    unsigned long __val[32];
}

and for one using bionic

(gdb) ptype sigset_t
type = unsigned long

This finally confirms where the bug is, and explains it: we are overwriting the stack because libProprietary reserves in the stack memory for bionic’s sigset_t, while we are using glibc’s sigemptyset(), which uses a different definition for it. As this definition is much bigger, the stack gets overwritten after the call to memset(). And we get the crash later when trying to restore registers when the function returns.

After knowing this, the solution was simple: I removed the libhybris hooks for signal functions, recompiled it, and… all worked just fine, no crashes anymore!

However, this is not the final solution: as signals are shared resources, it makes sense to hook them in libhybris. But, to do it properly, the hooks have to translate types between bionic in glibc, thing that we were not doing (we were simply calling glibc implementation). That, however, is “just work”.

Of course I wondered why the heck a library that is kind of generic needs to mess around with signals, but hey, that is not my fault ;-).

Conclusions

I can say I learned several things while debugging this:

  1. Not having the sources is terrible for debugging (well, I already knew this). Unfortunately not open sourcing the code is still a standard practice in part of the industry.
  2. The most interesting technical bit here is IMHO that we need to be very cautious with the backtrace that debuggers shows after a crash. If you start to see things that do not make sense it is possible that registers or stack have been messed up and the real crash happens elsewhere. Bear in mind that the very first thing to do when a program crashes is to make sure that we know the exact point where that happens.
  3. We have to be careful in ARM when disassembling, because if there is no debug information we could be seeing the wrong instruction set. We can check evenness of addresses used by bx/blx and of the lr to make sure we are in the right mode.
  4. Some times taking a look at assembly code can help us when debugging, even when we have the sources. Note that if I had had the C sources I would have seen the crash happening right when returning from a function, and it might not have been that immediate to find out that the stack was messed up. The assembly clearly pointed to an overwritten stack.
  5. Finally, I personally learned some bits of ARM architecture that I did not know, which was great.

Well, this is it. I hope you enjoyed the (lengthy, I know) article. Thanks for your reading!

[1] http://www.heyrick.co.uk/armwiki/The_Status_register
[2] We can get the same result by executing in gdb set arm fallback-mode thumb, but changing the register seemed more pedagogical here.
[3] http://infocenter.arm.com/help/topic/com.arm.doc.dui0068b/DUI0068.pdf
[4] http://reverseengineering.stackexchange.com/questions/6080/how-to-detect-thumb-mode-in-arm-disassembly
[5] http://www.mcternan.me.uk/ArmStackUnwinding/
[6] In fact the calls are to the PLT section, which is inside the library. The PLT calls in turn, by using addresses in the GOT data section, either directly the function or the dynamic loader, as we are doing lazy loading. See https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html, for instance.
[7] I had to use two breakpoints between consecutive instructions because the “ni” gdb command was not working well here.

Read more
pitti

We currently use completely different methods and tools of building test beds and running tests for Debian vs. Click packages, for normal uploads vs. CI airline landings vs. upstream project merge proposal testing, and keep lots of knowledge about Click package test metadata external and not easily accessible/discoverable.

Today I released autopkgtest 3.0 (and 3.0.1 with a few minor updates) which is a major milestone in unifying how we run package tests both locally and in production CI. The goals of this are:

  • Keep all test metadata, such as test dependencies, commands to run the test etc., in the project/package source itself instead of external. We have had that for a long time for Debian packages with DEP-8 and debian/tests/control, but not yet for Ubuntu’s Click packages.
  • Use the same tools for Debian and Click packages to simplify what developers have to know about and to reduce the amount of test infrastructure code to maintain.
  • Use the exact same testbeds and test runners in production CI than what developers use locally, so that you can reproduce and investigate failures.
  • Re-use the existing autopkgtest capabilities for using various kinds of testbeds, and conversely, making all new testbed types immediately available to all package formats.
  • Stop putting tests into the Ubuntu archive as packages (such as mediaplayer-app-autopilot). This just adds packaging and archive space overhead and also makes updating tests a lot harder and taking longer than it should.

So, let’s dive into the new features!

New runner: adt-virt-ssh

We want to run tests on real hardware such as a laptop of a particular brand with a particular graphics card, or an Ubuntu phone. We also want to restructure our current CI machinery to run tests on a real OpenStack cloud and gradually get rid of our hand-maintained QA lab with its test machines. While these use cases seem rather different, they both have in common that there is an already existing machine which is pretty much only accessible with ssh. Once you have an ssh connection, they look pretty much the same, you just need different initial setup (like fiddling with adb, calling nova boot, etc.) to prepare them.

So the new adt-virt-ssh runner factorizes all the common bits such as communicating with adt-run, auto-detecting sudo availability, doing SSH connection sharing etc., and delegates the target specific bits to a “setup script”. E. g. we could specify --setup-script ssh-setup-nova or --setup-script ssh-setup-adb which would then get called with open at the appropriate time by adt-run; it calls the nova commands to create a VM, or run a few adb commands to install/start ssh and install the public key. Then autopkgtest does its thing, and eventually calls the script with cleanup again. The actual protocol is a bit more involved (see manpage), but that’s the general idea.

autopkgtest now ships readymade scripts for these two use cases. So you could e. g. run the libpng tests in a temporary cloud VM:

# if you don't have one, create it with "nova keypair-create"
$ nova keypair-list
[...]
| pitti | 9f:31:cf:78:50:4f:42:04:7a:87:d7:2a:75:5e:46:56 |

# find a suitable image
$ nova image-list 
[...]
| ca2e362c-62c9-4c0d-82a6-5d6a37fcb251 | Ubuntu Server 14.04 LTS (amd64 20140607.1) - Partner Image                         | ACTIVE |  

$ nova flavor-list 
[...]
| 100 | standard.xsmall  | 1024      | 10   | 10        |      | 1     | 1.0         | N/A       |

# now run the tests: please be patient, this takes a few mins!
$ adt-run libpng --setup-commands="apt-get update" --- ssh -s /usr/share/autopkgtest/ssh-setup/nova -- \
   -f standard.xsmall -i ca2e362c-62c9-4c0d-82a6-5d6a37fcb251 -k pitti
[...]
adt-run [16:23:16]: test build:  - - - - - - - - - - results - - - - - - - - - -
build                PASS
adt-run: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ tests done.

Please see man adt-virt-ssh for details how to use it and how to write setup scripts. There is also a commented /usr/share/autopkgtest/ssh-setup/SKELETON template for writing your own for your use cases. You can also not use any setup script and just specify user and host name as options, but please remember that the ssh runner cannot clean up after itself, so never use this on important machines which you can’t reset/reinstall!

Test dependency installation without apt/root

Ubuntu phones with system images have a read-only file system where you can’t install test dependencies with apt. A similar case is using the “null” runner without root. When apt-get install is not available, autopkgtest now has a reduced fallback mode: it downloads the required test dependencies, unpacks them into a temporary directory, and runs the tests with $PATH, $PYTHONPATH, $GI_TYPELIB_PATH, etc. pointing to the unpacked temp dir. Of course this only works for packages which are relocatable in that way, i. e. libraries, Python modules, or command line tools; it will totally fail for things which look for config files, plugins etc. in hardcoded directory paths. But it’s good enough for the purposes of Click package testing such as installing autopilot, libautopilot-qt etc.

Click package support

autopkgtest now recognizes click source directories and *.click package arguments, and introduces a new test metadata specification syntax in a click package manifest. This is similar in spirit and capabilities to DEP-8 debian/tests/control, except that it’s using JSON:

    "x-test": {
        "unit": "tests/unittests",
        "smoke": {
            "path": "tests/smoketest",
            "depends": ["shunit2", "moreutils"],
            "restrictions": ["allow-stderr"]
        },
        "another": {
            "command": "echo hello > /tmp/world.txt"
        }
    }

For convenience, there is also some magic to make running autopilot tests particularly simple. E. g. our existing click packages usually specify something like

    "x-test": {
        "autopilot": "ubuntu_calculator_app"
    }

which is enough to “do what I mean”, i. e. implicitly add the autopilot test depends and run autopilot with the specified test module name. You can specify your own dependencies and/or commands, and restrictions etc., of course.

So with this, and the previous support for non-apt test dependencies and the ssh runner, we can put all this together to run the tests for e. g. the Ubuntu calculator app on the phone:

$ bzr branch lp:ubuntu-calculator-app
# built straight from that branch; TODO: where is the official" download URL?
$ wget http://people.canonical.com/~pitti/tmp/com.ubuntu.calculator_1.3.283_all.click
$ adt-run ubuntu-calculator-app/ com.ubuntu.calculator_1.3.283_all.click --- \
      ssh -s /usr/share/autopkgtest/ssh-setup/adb
[..]
Traceback (most recent call last):
  File "/tmp/adt-run.KfY5bG/tree/tests/autopilot/ubuntu_calculator_app/tests/test_simple_page.py", line 93, in test_divide_with_infinity_length_result_number
    self._assert_result("0.33333333")
  File "/tmp/adt-run.KfY5bG/tree/tests/autopilot/ubuntu_calculator_app/tests/test_simple_page.py", line 63, in _assert_result
    self.main_view.get_result, Eventually(Equals(expected_result)))
  File "/usr/lib/python3/dist-packages/testtools/testcase.py", line 406, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: After 10.0 seconds test failed: '0.33333333' != '0.3'

Ran 33 tests in 295.586s
FAILED (failures=1)

Note that the current adb ssh setup script deals with some things like applying the autopilot click AppArmor hooks and disabling screen dimming, but it does not do the first-time setup (connecting to network, doing the gesture intro) and unlocking the screen. These are still on the TODO list, but I need to find out how to do these properly. Help appreciated!

Click app tests in schroot/containers

But, that’s not the only thing you can do! autopkgtest has all these other runners, so why not try and run them in a schroot or container? To emulate the environment of an Ubuntu Touch session I wrote a --setup-commands script:

adt-run --setup-commands /usr/share/autopkgtest/setup-commands/ubuntu-touch-session \
    ubuntu-calculator-app/ com.ubuntu.calculator_1.3.283_all.click --- schroot utopic

This will actually work in the sense of running (and succeeding) the autopilot tests, but it will fail due to a lot of libust[11345/11358]: Error: Error opening shm /lttng-ust-wait... warnings on stderr. I don’t know what these mean, just that I also see them on the phone itself occasionally.

I also wrote another setup-commands script which emulates “read-only apt”, so that you can test the “unpack only” fallback. So you could prepare a container with click and the App framework preinstalled (so that it doesn’t always take ages to install them), starting from a standard adt-build-lxc container:

$ sudo lxc-clone -o adt-utopic -n click
$ sudo lxc-start -n click
  # run "sudo apt-get install click ubuntu-sdk-libs ubuntu-app-launch-tools" there
  # then "sudo powerdown"

# current apparmor profile doesn't allow remounting something read-only
$ echo "lxc.aa_profile = unconfined" | sudo tee -a /var/lib/lxc/click/config

Now that container has enough stuff preinstalled to be reasonably fast to set up, and the remaining test dependencies (mostly autopilot) work fine with the unpack/$*_PATH fallback:

$ adt-run --setup-commands /usr/share/autopkgtest/setup-commands/ubuntu-touch-session \
          --setup-commands /usr/share/autopkgtest/setup-commands/ro-apt \
          ubuntu-calculator-app/ com.ubuntu.calculator_1.3.283_all.click \
          --- lxc -es click

This will successfully run all the tests, and provided you have apt-cacher-ng installed, it only takes a few seconds to set up. This might be a nice thing to do on merge proposals, if you don’t have an actual phone at hand, or don’t want to clutter it up.

autopkgtest 3.0.1 will be available in Utopic tomorrow (through autosyncs). If you can’t wait to try it out, download it from my people.c.c page ☺.

Feedback appreciated!

Read more
Stéphane Graber

Ubuntu Touch images

For those not yet familiar with this, Ubuntu Touch systems are setup using a read-only root filesystem on top of which writable paths are mounted using bind-mounts from persistent or ephemeral storage.

The default update mechanism is therefore image based. We build new images on our build infrastructure, generate diffs between images and publish the result on the public server.

Each image is made of a bunch of xz compreseed tarballs, the actual number of tarballs may vary, so can their name. At the end of the line, the upgrader simply mounts the partitions and unpacks the tarball in the order it’s given them. It has a list of files to remove and the rest of the files are simply unpacked on top of the existing system.

Delta images only contain the files that are different from the previous image, full images contain them all. Partition images are stored in binary format in a partitions/ directory which the upgrader checks and flashes automatically.

The current list of tarballs we tend to use for the official images are:

  • ubuntu: Ubuntu root filesystem (common to all devices)
  • device: Device specific data (partition images and Android image)
  • custom: Customization tarball (applied on top of the root filesystem in /custom)
  • version: Channel/device/build metadata

For more details on how this all works, I’d recommend reading our wiki pages which act as the go-to specification for the server, clients and upgrader.

Running a server

There are a lot of reasons why you may want to run your own system-image server but the main ones seem to be:

  • Supporting your own Ubuntu Touch port with over-the-air updates
  • Publishing your own customized version of an official image
  • QA infrastructure for Ubuntu Touch images
  • Using it as an internal buffer/mirror for your devices

Up until now, doing this was pretty tricky as there wasn’t an easy way to import files from the public system-image server into a local one nor was there a simple way to replace the official GPG keys by your own (which would result in your updates to be considered invalid).

This was finally resolved on Friday when I landed the code for a few new file generators in the main system-image server branch.

It’s now reasonably easy to setup your own server, have it mirror some bits from the main public server, swap GPG keys and include your local tarballs.

Before I start with step by step instructions, please note that due to bug 1278589, you need a valid SSL certificate (https) on your server. This may be a problem to some porters who don’t have a separate IP for their server or can’t afford an SSL certificate. We plan on having this resolved in the system-image client soon.

Installing your server

Those instructions have been tried on a clean Ubuntu 13.10 cloud instance, it assumes that you are running them as an “ubuntu” user with “/home/ubuntu” as its home directory.

Install some required packages:

sudo apt-get install -y bzr abootimg android-tools-fsutils \
    python-gnupg fakeroot pxz pep8 pyflakes python-mock apache2

You’ll need a fair amount of available entropy to generate all the keys used by the test suite and production server. If you are doing this for testing only and don’t care much about getting strong keys, you may want to install “haveged” too.

Then setup the web server:

sudo adduser $USER www-data
sudo chgrp www-data /var/www/
sudo chmod g+rwX /var/www/
sudo rm -f /var/www/index.html
newgroups www-data

That being done, now let’s grab the server code, generate some keys and run the testsuite:

bzr branch lp:~ubuntu-system-image/ubuntu-system-image/server system-image
cd system-image
tests/generate-keys
tests/run
cp -R tests/keys/*/ secret/gpg/keys/
bin/generate-keyrings

Now all you need is some configuration. We’ll define a single “test” channel which will contain a single device “mako” (nexus4). It’ll mirror both the ubuntu and device tarball from the main public server (using the trusty-proposed channel over there), repack the device tarball to swap the GPG keys, then download a customization tarball from an http server, stack a keyring tarball (overriding the keys in the ubuntu tarball) and finally generating a version tarball. This channel will contain up to 15 images and will start at image ID “1”.

Doing all this can be done with that bit of configuration (you’ll need to change your server’s FQDN accordingly) in etc/config:

[global]
base_path = /home/ubuntu/system-image/
channels = test
gpg_key_path = secret/gpg/keys/
gpg_keyring_path = secret/gpg/keyrings/
publish_path = /var/www/
state_path = state/
public_fqdn = system-image.test.com
public_http_port = 80
public_https_port = 443

[channel_test]
type = auto
versionbase = 1
fullcount = 15
files = ubuntu, device, custom-savilerow, keyring, version
file_ubuntu = remote-system-image;https://system-image.ubuntu.com;trusty-proposed;ubuntu
file_device = remote-system-image;https://system-image.ubuntu.com;trusty-proposed;device;keyring=archive-master
file_custom-savilerow = http;https://jenkins.qa.ubuntu.com/job/savilerow-trusty/lastSuccessfulBuild/artifact/build/custom.tar.xz;name=custom-savilerow,monitor=https://jenkins.qa.ubuntu.com/job/savilerow-trusty/lastSuccessfulBuild/artifact/build/build_number
file_keyring = keyring;archive-master
file_version = version

Lastly we need to actual create the channel and device in the server, this is done by calling “bin/si-shell” and then doing:

pub.create_channel("test")
pub.create_device("test", "mako")
for keyring in ("archive-master", "image-master", "image-signing", "blacklist"):
    pub.publish_keyring(keyring)

And that’s it! Your server is now ready to use.
To generate your first image, simply run “bin/import-images”.
This will take a while as it’ll need to download files from those external servers, repack some bits but once it’s done, you’ll have a new image published.

You’ll probably want to run that command from cron every few minutes so that whenever any of the referenced files change a new image is generated and published (deltas will also be automatically generated).

To look at the result of the above, I have setup a server here: https://phablet.stgraber.org

To use that server, you’d flash using: phablet-flash ubuntu-system –alternate-server phablet.stgraber.org –channel test

Read more
Stéphane Graber

This is post 4 out of 10 in the LXC 1.0 blog post series.

Running foreign architectures

By default LXC will only let you run containers of one of the architectures supported by the host. That makes sense since after all, your CPU doesn’t know what to do with anything else.

Except that we have this convenient package called “qemu-user-static” which contains a whole bunch of emulators for quite a few interesting architectures. The most common and useful of those is qemu-arm-static which will let you run most armv7 binaries directly on x86.

The “ubuntu” template knows how to make use of qemu-user-static, so you can simply check that you have the “qemu-user-static” package installed, then run:

sudo lxc-create -t ubuntu -n p3 -- -a armhf

After a rather long bootstrap, you’ll get a new p3 container which will be mostly running Ubuntu armhf. I’m saying mostly because the qemu emulation comes with a few limitations, the biggest of which is that any piece of software using the ptrace() syscall will fail and so will anything using netlink. As a result, LXC will install the host architecture version of upstart and a few of the networking tools so that the containers can boot properly.

stgraber@castiana:~$ file /bin/ls
/bin/ls: ELF 64-bit LSB  executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.24, """BuildID[sha1]""" =e50e0a5dadb8a7f4eaa2fd715cacb9842e157dc7, stripped
stgraber@castiana:~$ sudo lxc-start -n p3 -d
stgraber@castiana:~$ sudo lxc-attach -n p3
root@p3:/# file /bin/ls
/bin/ls: ELF 32-bit LSB  executable, ARM, EABI5 version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, """BuildID[sha1]""" =88ff013a8fd9389747fb1fea1c898547fb0f650a, stripped
root@p3:/# exit
stgraber@castiana:~$ sudo lxc-stop -n p3
stgraber@castiana:~$

Hooks

As we know people like to script their containers and that our configuration can’t always accommodate every single use case, we’ve introduced a set of hooks which you may use.

Those hooks are simple paths to an executable file which LXC will run at some specific time in the lifetime of the container. Those executables will also be passed a set of useful environment variables so they can easily know what container invoked them and what to do.

The currently available hooks are (details in lxc.conf(5)):

  • lxc.hook.pre-start (called before any initialization is done)
  • lxc.hook.pre-mount (called after creating the mount namespace but before mounting anything)
  • lxc.hook.mount (called after the mounts but before pivot_root)
  • lxc.hook.autodev (identical to mount but only called if using autodev)
  • lxc.hook.start (called in the container right before /sbin/init)
  • lxc.hook.post-stop (run after the container has been shutdown)
  • lxc.hook.clone (called when cloning a container into a new one)

Additionally each network section may also define two additional hooks:

  • lxc.network.script.up (called in the network namespace after the interface was created)
  • lxc.network.script.down (called in the network namespace before destroying the interface)

All of those hooks may be specified as many times as you want in the configuration so you can use each hooking point multiple times.

As a simple example, let’s add the following to our “p1″ container:

lxc.hook.pre-start = /var/lib/lxc/p1/pre-start.sh

And create the hook itself at /var/lib/lxc/p1/pre-start.sh:

#!/bin/sh
echo "arguments: $*" > /tmp/test
echo "environment:" >> /tmp/test
env | grep LXC >> /tmp/test

Make it executable (chmod 755) and then start the container.
Checking /tmp/test you should see:

arguments: p1 lxc pre-start
environment:
LXC_ROOTFS_MOUNT=/usr/lib/x86_64-linux-gnu/lxc
LXC_CONFIG_FILE=/var/lib/lxc/p1/config
LXC_ROOTFS_PATH=/var/lib/lxc/p1/rootfs
LXC_NAME=p1

Android containers

I’ve often been asked whether it was possible to run Android in an LXC container. Well, the short answer is yes. However it’s not very simple and it really depends on what you want to do with it.

The first thing you’ll need if you want to do this is get your machine to run an Android kernel, you’ll need to have any modules needed by Android built and loaded before you can start the container.

Once you have that, you’ll need to create a new container by hand.
Let’s put it in “/var/lib/lxc/android/”, in there, you need a configuration file similar to this one:

lxc.rootfs = /var/lib/lxc/android/rootfs
lxc.utsname = armhf

lxc.network.type = none

lxc.devttydir = lxc
lxc.tty = 4
lxc.pts = 1024
lxc.arch = armhf
lxc.cap.drop = mac_admin mac_override
lxc.pivotdir = lxc_putold

lxc.hook.pre-start = /var/lib/lxc/android/pre-start.sh

lxc.aa_profile = unconfined

/var/lib/lxc/android/pre-start.sh is where the interesting bits happen. It needs to be an executable shell script, containing something along the lines of:

#!/bin/sh
mkdir -p $LXC_ROOTFS_PATH
mount -n -t tmpfs tmpfs $LXC_ROOTFS_PATH

cd $LXC_ROOTFS_PATH
cat /var/lib/lxc/android/initrd.gz | gzip -d | cpio -i

# Create /dev/pts if missing
mkdir -p $LXC_ROOTFS_PATH/dev/pts

Then get the initrd for your device and place it in /var/lib/lxc/android/initrd.gz.

At that point, when starting the LXC container, the Android initrd will be unpacked on a tmpfs (similar to Android’s ramfs) and Android’s init will be started which in turn should mount any partition that Android requires and then start all of the usual services.

Because there are no apparmor, cgroup or even network configuration applied to it, the container will have a lot of rights and will typically completely crash the machine. You unfortunately have to be familiar with the way Android works and not be afraid to modify its init scripts if not even its init process to only start the bits you actually want.

I can’t provide a generic recipe there as it completely depends on what you’re interested on, what version of Android and what device you’re using. But it’s clearly possible to do and you may want to look at Ubuntu Touch to see how we’re doing it by default there.

One last note, Android’s init script isn’t in /sbin/init, so you need to tell LXC where to load it with:

lxc-start -n android -- /init

LXC on Android devices

So now that we’ve seen how to run Android in LXC, let’s talk about running Ubuntu on Android in LXC.

LXC has been ported to bionic (Android’s C library) and while not feature-equivalent with its glibc build, it’s still good enough to be used.

Unfortunately due to the kind of low level access LXC requires and the fact that our primary focus isn’t Android, installation could be easier…You won’t be finding LXC on the Google PlayStore and we won’t provide you with a .apk that you can install.

Instead every time something changes in the upstream git branch, we produce a new tarball which can be downloaded here: https://jenkins.linuxcontainers.org/view/LXC/view/LXC%20builds/job/lxc-build-android/lastSuccessfulBuild/artifact/lxc-android.tar.gz

This build is known to work with Android >= 4.2 but will quite likely work on older versions too.

For this to work, you’ll need to grab your device’s kernel configuration and run lxc-checkconfig against it to see whether it’s compatible with LXC or not. Unfortunately it’s very likely that it won’t be… In that case, you’ll need to go hunt for the kernel source for your device, add the missing feature flags, rebuild it and update your device to boot your updated kernel.

As scary as this may sound, it’s usually not that difficult as long as your device is unlocked and you’re already using an alternate ROM like Cyanogen which usually make their kernel git tree easily available.

Once your device has a working kernel, all you need to do is unpack our tarball as root in your device’s / directory, copy an arm container to /data/lxc/containers/<container name>, get into /data/lxc and run “./run-lxc lxc-start -n <container name>”.
A few seconds later you’ll be greeted by a login prompt.

Read more
Stéphane Graber

After over 3 months of development and experimentation, I’m now glad to announce that the system images are now the recommended way to deploy and update the 4 supported Ubuntu Touch devices, maguro (Galaxy Nexus), mako (Nexus 4), grouper (Nexus 7) and manta (Nexus 10).

Anyone using one of those devices can choose to switch to the new images using: phablet-flash ubuntu-system

Once that’s done, further updates will be pushed over the air and can be applied through the Updates panel in the System Settings.

Ubuntu Touch Upgrader

You should be getting a new update every few days, whenever an image is deemed of sufficient quality for public consumption. Note that the downloader UI doesn’t yet show progress, so if it doesn’t do anything, it doesn’t mean it’s not working.

Those new images are read-only except for a few selected files and for the user profile and data, this is a base requirement for the delta updates to work properly.
However if the work you’re doing requires installation of extra non-click packages, such as developing on your device using the SDK, you have two options:

  1. Stick to the current flipped images which we’ll continue to generate for the foreseeable future.
  2. Use the experimental writable flag by doing touch /userdata/.writable_image and rebooting your device.
    This will make / writable again, however beware that applying image updates on such a system will lead to unknown results, so if you do choose to use this flag, you’ll have to manually update your device using apt-get (and possibly have to unmount/remount some of the bind-mounted files depending on which package needs to be updated).

From now on, the QA testing effort will focus on those new images rather than the standard flipped ones. I’d also highly recommend to all our application developers to at least test their apps with those images and report any bug that they see in #ubuntu-touch (irc.freenode.net).

 

Read more
Daniel Holbach

These are very exciting times for Ubuntu Touch. Not only is the Ubuntu Edge, an Ubuntu super-phone, being funded right now, but we are also making lots of progress on getting Ubuntu running perfectly on phones and tablets near you.

Ubuntu Touch

I blogged about this a couple of times now, but Ubuntu Touch has been ported to LOTS of devices in the meantime. If we consult our Touch Devices list, there are 45 working ports, with 30 more in progress, and across 21 different brands. This is awesome. Now it’s time to bring all of them into the fold.

There are two things we have to do:

  1. Update some of the ports to the flipped container model. This switch has been happening over the last couple of weeks, but we’re there now. Android bits now run on top of an Ubuntu container. Some of the images still need to be updated to benefit from this.
  2. Enable the ports in phablet-flash. Yes, you read correctly. Since the announce of the Touch preview, we only supported four devices (Galaxy Nexus, Nexus 4, Nexus 7 and Nexus 10). We always wanted to make it easier to flash all other devices too, and now we’re almost there: If you as an image maintainer make some information available, phablet-flash will soon be able to pick it up.

Updating your image to the new world order is something we are discussing today, 1st August, in #ubuntu-touch on irc.freenode.net. We are having an Ubuntu Touch Porting Clinic today. So bring your device, your questions and we’ll help you get set up for the new image formats.

If you want your images to be supported by phablet-flash, that can be easily arranged too. Follow this process, to document how the flashing of your image works. Check out the latest branch of phablet-flash (not yet landed in trunk) to try out if your image works: lp:~sergiusens/phablet-tools/flash_change.

As always: if you have any questions, talk to us on #ubuntu-touch on irc.freenode.net or on the ubuntu-phone mailing list.

Update: now it’s just
bzr branch lp:phablet-tools; cd phablet-tools
./phablet-flash community --device <vendor>

Read more
Stéphane Graber

Some of you may be aware that I along with Barry Warsaw and Ondrej Kubik have been working on image based upgrades for Ubuntu Touch.
This is going to be the official method to update any Ubuntu Touch devices. When using this system, the system will effectively be read-only with updates being downloaded over the air from a central server and applied in a consistent way across all devices.
Design details may be found at: http://wiki.ubuntu.com/ImageBasedUpgrades

After several months of careful design and implementation, we are now ready to get more testers. We are producing daily images for our 4 usual devices, Galaxy Nexus (maguro), Nexus 4 (mako), Nexus 7 (grouper) and Nexus 10 (manta).
At this point, only those devices are supported. We’ll soon be working with the various ports to see how to get them running on the new system.

So what’s working at this point?

  • Daily delta images are generated and published to
    http://system-image.ubuntu.com
  • We have a command line client tool (system-image-cli), an update server and an upgrader sitting in the recovery partition
  • The images usually boot and work

What doesn’t work?

  • Installing apps as the system partition is read-only and we’re waiting for click packages to be fully implemented in our images
  • Data migration. We haven’t implemented any migration script from the current images to the new ones, so switching will wipe everything from your device
  • Possibly quite some more features I haven’t tested yet

So how can I help?

You can help us if:

  • You have one of the 4 supported devices
  • You don’t use that device for your everyday work
  • You don’t need to install any extra apps
  • You don’t care about losing all your existing data
  • You’re usually able to use adb/fastboot to recover from any problems that might happen

If you don’t fit all of the above criteria, please stick to the current flipped images.
If you think you’re able to help us and want to test those new images, then here’s how to switch to them:

  1. Get the latest version of phablet-tools (>= 0.15+13.10.20130720.1-0ubuntu1)
  2. Boot your device
  3. Backup anything you may want to keep as it’ll be wiped clean!!!
  4. Run: phablet-flash --ubuntu-bootstrap
  5. Wait for it to finish downloading and installing
  6. You’re done!
  7. To apply any further update, use: adb shell system-image-cli
    (never use phablet-flash after the initial flash, updates can only be applied through system-image-cli!)

Reverting to standard flipped images:

  • Boot your device
  • Backup anything you may want to keep as it’ll be wiped clean!!!
  • Run: phablet-flash –bootstrap
  • Wait for it to finish downloading and installing
  • You’re back to standard flipped images!

To report bugs, the easiest is to go to:
https://launchpad.net/ubuntu-image-image/+filebug

We also all hangout in #ubuntu-touch on irc.freenode.net

 

Read more