Wednesday, 28 December 2011

Monitoring /proc/timer_stats

The /proc/timer_stats interface allows one to check on timer usage in a Linux system and hence detect any misuse of timers that can cause excessive wake up events (and also waste power).  /proc/timer_stats reports the process id (pid) of a task that initialised the timer, the name of the task, the name of the function that initialised the timer and the name of the timer callback function.  To enable timer sampling, write "1\n" to /proc/timer_stats and to disable write "0\n".

While this interface is simple to use, collecting multiple samples over a long period of time to monitor overall system behaviour takes a little more effort.   To help with this, I've written a very simple tool called eventstat that calculates the rate of events per second and can dump the data in a .csv (comma separated values) format for importing into a spreadsheet such as LibreOffice for further analysis (such as graphing).

In its basic form, eventstat will run ad infinitum and can be halted by control-C. One can also specify the sample period and number of samples to gather, for example:

 sudo eventstat 10 60  

.. this gathers samples every 10 seconds for 60 samples (which equates to 10 minutes).

The -t option specifies an events/second threshold to discard events less than this threshold, for example:  sudo cpustat -t 10 will show events running at 10Hz or higher.

To dump the samples into a .csv file, use the -r option followed by the name of the .csv file.  If you just want to collect just the samples into a .csv file and not see the statistics during the run, use also the -q option, e.g.

 sudo eventstat -q -r event-report.csv

With eventstat you can quickly identify rouge processes that cause a high frequency of wake ups.   Arguably one can do this with tools such as PowerTop, but eventstat was written to allow one to collect the event statistics over a very long period of time and then help to analyse or graph the data in tools such as Libre Office spreadsheet.

The source is available in the following git repository:  git:// and in my power management tools PPA:

In an ideal world, application developers should check their code with tools like eventstat or PowerTop to ensure that the application is not misbehaving and causing excessive wake ups especially because abuse of timers could be happening in the supporting libraries that applications may be using.

Tuesday, 13 December 2011

Improving Battery Life in Ubuntu Precise 12.04 LTS

Part of my focus this cycle is to see where we can make power saving improvements for Ubuntu Precise 12.04 LTS. There has been a lot of anecdotal evidence of specific machines or power saving features behaving poorly over the past few cycles.   So, armed with a 6.5 digit precision multimeter from Fluke I've been measuring the power consumption on various laptops in different test scenarios to try and answer some outstanding questions:

* Is it safe to enable Matthew Garrett's PCIe ASPM fix?
* Are the power savings suggested by PowerTop useful and can we reliably enabled any of these in pm-utils?
* How accurate are the ACPI battery readings to estimate power consumption?
* Do the existing pm-utils power.d scripts still make sense?
* Which is better for power saving: i386, i386-pae or amd64?
* How much power does the laptop backlight really use?
* Does halving the mouse input rate really save that much more power?
* Should we re-enable Aggressive Link Power Management (ALPM)?
* Are there any misbehaving applications that are consuming too much power?
* What are the root causes of HDD wake-ups
* Which applications and daemons are creating unnecessary wake events?
* How much does the MSR_IA32_ENERGY_PERF_BIAS save us?

..and many more besides!

From some of the analysis and "crowd sourcing" tests it is clear that the PCIe ASPM fix works well, so we've already incorporated that into Precise.

Aggressive Link Power Management (ALPM) is a mechanism where a SATA AHCI controller can put the SATA link that connects to the disk into a very low power mode during periods of zero I/O activity and into an active power state when work needs to be done. Tests show that this can save around 0.5-1.5 Watts of power on a typical system. However, it has been known in the past to not work on some devices, so I've put a call for testing of ALPM out to the community so we can get a better understanding of the power savings vs reliability.

Some of the PowerTop analysis has shown we can save another 1-2 Watts of power by putting USB and PCI controllers of devices like Webcams, SD card controllers, Wireless, Ethernet and Bluetooth  into a lower power state.  Again, we would like to understand the range of power savings across a large set of hardware and to see how reliable this is, so another crowd sourced call for testing has been also set up.

So, if you want to contribute to the testing, please visit the above links and spend just a few tens of minutes to see we can extend the battery life of your laptop or netbook.  And periodically visit to see if there any new tests you can participate in.


I've written some brief notes on power saving tweaks and also some simple recommendations for application developers to follow too.

The thread continues here (part 2)

Wednesday, 16 November 2011

Google's _nomap SSID madness

Now, I try to write positive comments on my blog, but now and again things really irk me and I need to comment about them.  

Google is using wireless access point SSIDs to construct a database to enable devices to determine their location using wireless and hence not relying on GPS.   If you want to opt out of this database, Google is suggesting that one should simply append _nomap to the SSID.   Google also hopes that this will become a standard SSID opt-out for any location service database.

This basically means that if you want your desired SSID you get opted into Google's database (so much for privacy), otherwise you have to put up with some utterly stupid name that Google mandates.  Thanks for the choice Google.  And there is nothing to stop other location service providers either suggesting a different naming scheme to make it impossible to opt out of one or more schemes.

Now, if the UK government mandated that all SSIDs needed to be named in a specific way to opt out of their special database, there would be uproar.  However, Google just ploughs ahead with more of their data gathering and nobody seems to complain.

Monday, 7 November 2011

Does your UEFI firmware have CSM support or not?

UEFI  Compatibility Support Module (CSM) provides compatibility support for traditional legacy BIOS.  This allows allows the booting an operating system that requires a traditional option ROM support, such as BIOS Int 10h video calls.

While looking at boot and runtime misbehaviour on UEFI systems I would like to know if CSM is enabled or not, but the question is how does one detect CSM support?   Well, making the assumption that CSM is generally enabled to support Int 10h video calls, we look for any video option ROMs and see if the real mode Int 10h vector is set to jump to a handler in one of the ROMs.

Option ROMs are found in the region 0xc0000 to 0xe0000 and normally the video option ROM is found at 0xc0000.  Option ROMs are found on 512 byte boundaries with a header bytes containing 0x55, 0xaa and ROM length (divided by 512) so we just mmap in 0xc0000..xe0000 and then scan the memory for headers to locate option ROM images.

My assumption for CSM being enabled is that Int 10h vectors into one of these option ROMs, and we can assume it is a video option ROM if it contains the string "VGA" somewhere in the ROM image.  Yes, it is a hack, but it seems to work on the range of UEFI enabled systems I've so far used.

For reference, I've put the code in my debug-code git repository and available for anyone to use.

I've also added a CSM test to the Firmware Test Suite (fwts).  One can install fwts and run the test on Ubuntu by using:

 sudo apt-get install fwts  
 sudo fwts csm -  

Friday, 28 October 2011

UEFI Secure Boot and Linux

There has been a lot of (heated) discussion in the past weeks concerning UEFI Secure Boot and how this can impact on the ability of a user to install their operating system of choice.

To address this, today has seen no less than two papers published to address this hot topic.  Canonical along with Red Hat have published a white paper that describes how UEFI Secure Boot will impact users and manufactures.   The paper also provides recommendations on the implementation of UEFI Secure Boot in way that allows users to be in control of their own PC hardware.

Meanwhile the Linux Foundation has also published a paper giving technical guidance on how to implement UEFI Secure Boot to allow operating systems other than Windows 8 to operate on new Windows 8 PCs.

So lots to read and good technical guidance all round.  Let's  hope that these constructive set of papers will push the argument into a positive outcome.

Sunday, 23 October 2011

Determining the number of arguments in a C vararg macro

C vararg macros are very useful and I've generally used them a lot for wrapping C vararg functions.  However, at times it would be very useful to be able to determine the number of arguments being passed into the the vararg macro and this is not as straight forward as it first seems.

Anyhow, this problem has been asked many times on the usenet and internet, and I stumbled on a very creative solution by Laurent Deniau posted on comp.std.c back in 2006.

 #define PP_NARG(...) \  
 #define PP_NARG_(...) \  
 #define PP_ARG_N( \  
      _1, _2, _3, _4, _5, _6, _7, _8, _9,_10, \  
      _11,_12,_13,_14,_15,_16,_17,_18,_19,_20, \  
      _21,_22,_23,_24,_25,_26,_27,_28,_29,_30, \  
      _31,_32,_33,_34,_35,_36,_37,_38,_39,_40, \  
      _41,_42,_43,_44,_45,_46,_47,_48,_49,_50, \  
      _51,_52,_53,_54,_55,_56,_57,_58,_59,_60, \  
      _61,_62,_63,N,...) N  
 #define PP_RSEQ_N() \  
      63,62,61,60,          \  
      59,58,57,56,55,54,53,52,51,50, \  
      49,48,47,46,45,44,43,42,41,40, \  
      39,38,37,36,35,34,33,32,31,30, \  
      29,28,27,26,25,24,23,22,21,20, \  
      19,18,17,16,15,14,13,12,11,10, \  
 /* Some test cases */  
 PP_NARG(A) -> 1  
 PP_NARG(A,B) -> 2  
 PP_NARG(A,B,C) -> 3  
 PP_NARG(A,B,C,D) -> 4  
 PP_NARG(A,B,C,D,E) -> 5   

However, passing no arguments to this macro yields 1, which is not as we expect. So last night I tweaked the macro to fix this problem by checking the length of the stringified macro arguments and adjusting the return value for a empty __VA_ARGS__  - as follows:

 #define PP_NARG(...)  (PP_NARG_(__VA_ARGS__,PP_RSEQ_N()) - \  
     (sizeof(#__VA_ARGS__) == 1))  
 #define PP_NARG_(...)  PP_ARG_N(__VA_ARGS__)  
 #define PP_ARG_N( \  
    _1, _2, _3, _4, _5, _6, _7, _8, _9,_10, \  
   _11,_12,_13,_14,_15,_16,_17,_18,_19,_20, \  
   _21,_22,_23,_24,_25,_26,_27,_28,_29,_30, \  
   _31,_32,_33,_34,_35,_36,_37,_38,_39,_40, \  
   _41,_42,_43,_44,_45,_46,_47,_48,_49,_50, \  
   _51,_52,_53,_54,_55,_56,_57,_58,_59,_60, \  
   _61,_62,_63, N, ...) N  
 #define PP_RSEQ_N() \  
     63,62,61,60,          \  
     59,58,57,56,55,54,53,52,51,50, \  
     49,48,47,46,45,44,43,42,41,40, \  
     39,38,37,36,35,34,33,32,31,30, \  
     29,28,27,26,25,24,23,22,21,20, \  
     19,18,17,16,15,14,13,12,11,10, \  

The purists may point out that PP_NARG() only handles 64 arguments.  For just integer arguments, a better solution for any number of arguments has been proposed by user qrdl on stackoverflow:

 #define NUMARGS(...) (int)(sizeof((int[]){0, ##__VA_ARGS__})/sizeof(int)-1)

..which is appealing as it is more immediately understandable than the PP_NARG() macro, however it is less generic since it only works for ints.

Anyhow, it's great to find such novel solutions even if they may be at first a little bit non-intuitive.

Forcing a CMOS reset from userspace

Resetting CMOS memory on x86 platforms is normally achieved by either removing the CMOS battery or by setting a CMOS clear motherboard jumper in the appropriate position.  However, both these methods require access to the motherboard which is time consuming especially when dealing with a laptop or netbook.

An alternative method is to twiddle specific bits in the CMOS memory so that the checksum is no longer valid and on the next boot the BIOS detects this and this generally forces a complete CMOS reset.

I've read several ways to do this, however the CMOS memory layout varies from machine to machine so some suggested solutions may be unreliable across all platforms.  Apart from the Real Time Clock (which writing to won't affect a CMOS reset), the only CMOS addresses to be consistently used across most machines are 0x10 (Floppy Drive Type), 0x2e (CMOS checksum high byte) and 0x2f (CMOS checksum low byte).  With this in mind, it seems that the best way to force a CMOS reset is to corrupt the checksum bytes, so my suggested solution is to totally invert each bit of the checksum bytes.

To be able to read the contents of CMOS memory we need to write the address of the memory to port 0x70 then delay a small amount of time and then read the contents by reading port 0x71.    To write to CMOS memory we again write the address to port 0x70, delay a little, and then write the value to port 0x71.   A small delay of 1 microsecond (independent of CPU speed)  can be achieved by writing to port 0x80 (the Power-On-Self-Test (POST) code debug port).

 static inline uint8_t cmos_read(uint8_t addr)  
     outb(addr, 0x70);    /* specify address to read */  
     outb(0, 0x80);       /* tiny delay */  
     return inb(0x71);    /* read value */  
 static inline void cmos_write(uint8_t addr, uint8_t val)  
     outb(addr, 0x70);    /* specify address to write */  
     outb(0, 0x80);       /* tiny delay */  
     outb(val, 0x71);     /* write value */  

And hence inverting CMOS memory at a specified address is thus:

 static inline void cmos_invert(uint8_t addr)  
     cmos_write(addr, 255 ^ cmos_read(addr));  

To ensure we are the only process accessing the CMOS memory we should also turn off interrupts, so we use iopl(3) and asm("cli") to do this and then asm("sti") and iopl(0) to undo this.   We also need to use ioperm() to get access to ports 0x70, 0x71 and 0x80 for cmos_read() and cmos_write() to work and we need to run the program with root privileges.  The final program is as follows:

 #include <stdio.h>  
 #include <stdlib.h>  
 #include <stdint.h>  
 #include <unistd.h>  
 #include <sys/io.h>  
 #define CMOS_CHECKSUM_HI (0x2e)  
 #define CMOS_CHECKSUM_LO (0x2f)  
 static inline uint8_t cmos_read(uint8_t addr)  
     outb(addr, 0x70);    /* specify address to read */  
     outb(0, 0x80);       /* tiny delay */  
     return inb(0x71);    /* read value */  
 static inline void cmos_write(uint8_t addr, uint8_t val)  
     outb(addr, 0x70);    /* specify address to write */  
     outb(0, 0x80);       /* tiny delay */  
     outb(val, 0x71);     /* write value */  
 static inline void cmos_invert(uint8_t addr)  
     cmos_write(addr, 255 ^ cmos_read(addr));  
 int main(int argc, char **argv)  
     if (ioperm(0x70, 2, 1) < 0) {  
         fprintf(stderr, "ioperm failed on ports 0x70 and 0x71\n");  
     if (ioperm(0x80, 1, 1) < 0) {  
         fprintf(stderr, "ioperm failed on port 0x80\n");  
     if (iopl(3) < 0) {  
         fprintf(stderr, "iopl failed\n");  
     /* Invert CMOS checksum, high and low bytes*/  
     (void)ioperm(0x80, 1, 0);  
     (void)ioperm(0x70, 2, 0);  

You can find this source in by debug code git repo.

Before you run this program, make sure you know which key should be pressed to jump into the BIOS settings on reboot (such as F2, delete, backspace,ESC, etc.) as some machines may just display a warning message on reboot and need you to press this key to progress further.

So to reset, simple run the program with sudo and reboot.  Easy.  (Just don't complain to me if your machine isn't easily bootable after running this!)

Wednesday, 19 October 2011

PCI Interrupt Routing

Understanding PCI Interrupt Routing on the x86 platform is a not entirely straight forward.  The underlining principle is determining which interrupt is being asserted when a PCI interrupt signal occurs. Unfortunately this is generally platform specific and so firmware tables of various types have been used over the many years to describe the routing configuration.

While looking into the legacy PCI interrupt routing tables I found an excellent article by FreeBSD kernel hacker John Baldwin that explains PCI interrupt routing in a clear an succinct manner.   Although written for FreeBSD, this article is contains a lot of Linux relevant information.

Tuesday, 18 October 2011

UEFI EDK II Revisited

My colleague Manoj Iyer has written up a guide on how to download EDK II and build the UEFI firmware for QEMU.   This requires older versions of gcc found in Natty (since the newer Oneiric version is more pedantic and uses -Werror=unused-but-set-variable by default).

With a chroot I was able get it downloaded, built and tested in less than 40 minutes.   Here is a sample UEFI helloworld application running in QEMU using the firmware using Manoj's instructions.


Now I can rig up some tests to exercise Ubuntu and the Firmware Test Suite without the need for any real UEFI hardware..

Thursday, 13 October 2011

Dennis Ritchie, R.I.P.

Dennis Ritchie has passed away. He gave us C and UNIX and much more beside. My tribute to Dennis Ritchie is as follows:

 #include <stdio.h>  
 #include <stdlib.h>  
 #define K continue  
 #define t /*|+$-*/9  
 #define _l /*+$*/25  
 #define s/*&|+*/0xD  
 #define _/*&|+*/0xC  
 #define _o/*|+$-*/2  
 #define _1/*|+$-*/3  
 #define _0/*|+$*/16  
 #define J/*&|*/case  
 char typedef signed   
  B;typedef H;H main(  
  ){B I['F'],V=0,E[]=  
      {B L=*P,l=*(P+1),U=  
    (L)while(0){J _l:i=  
   t:i=U+l;K;J _:i=U;  
   K;J s:i=l;K;J _o:  
  putchar(U);K;J _1:  
  P+=U?0:l;K;J _0:e\  

You can download the source here.

Tuesday, 11 October 2011

Gource - software version control visualization

Today my colleague Chris Van Hoof pointed me to a Gource visualization of the work I've been doing on the Firmware Test Suite.  Gource animates the software development sources as a tree with the root in the centre of the display and directories as branches and source files as leaves.

Static pictures do this no justice. I've uploaded an mp4 video of the entire software development history of fwts so you can see Gource in action.

To generate the video, the following incantation was used:

 gource -s 0.03 --auto-skip-seconds 0.1 --file-idle-time 500 \  
 --multi-sampling -1280x720 --stop-at-end \  
 --output-ppm-stream - | ffmpeg -y -r 24 \  
 -f image2pipe -vcodec ppm -i - -b 2048K fwts.mp4  

..kudos to Chris for this rune.

Saturday, 1 October 2011

Proprietary Code - Where do we draw the line?

I can't help being amused when users say they chose not to use a specific Open Source distribution because it contains binary drivers and hence is not totally free.  I too really care that we have software freedom and try to work towards a totally free Operating System but where do we draw the line?

Some users state that they won't touch a specific brand of hardware such as Wireless or Video because one has to use a binary driver or that it contains firmware that is not Open Source.  While this is an admirable philosophical stance it has its blind-spots. For example,  laptops contain Embedded Controllers to do a variety of hardware interfacing tasks - do we refuse to use these laptops because the firmware in the Embedded Controllers are not Open Source?  Or how about the ACPI AML code that appears in the DSDT and SSDTs - so should we boot the machine with ACPI disabled because this code is not Open Source?

Taking it further, what about the Microcode inside the processor?  This binary blob is loaded by BIOS updated by the Operating System to fix subtle features in Microprocessors post-release.  So, should we stop using this because Intel won't supply us the source?

So at what point do we stop using a system because it is not fully Open Source?  OK, so I've taking the argument to its logical conclusion to stretch the point.   I fully understand that it is totally desirable to avoid using Closed Source binary blobs where possible and trying to keep a system totally Open Source keeps us honest.  However, sometimes I find the purest viewpoint rather blinkered if it refuses to use a specific distribution when their machine is riddled with Closed Source binary firmware blobs.  Perhaps they should start working on the BIOS vendors and Intel to release their code..

Friday, 30 September 2011

Why is my CPU Frequency Limited?

Sometimes the Scaling Maximum Frequency of a CPU is reduced below that of the possible top frequency and finding out why this is can be problematic.  The limitation could have been imposed by:

* Thermal limits
* Hardware limitations (e.g. ACPI _PPC object).
* Program that wrote to a /sys/devices/cpu/cpu*/cpufreq/scaling_max_freq

Fortunately Thomas Renninger introduced /sys/devices/cpu*/cpufreq/bios_limit that exports to user space the BIOS limited maximum frequency for each CPU.  This feature is available in Ubuntu Maverick 10.10 upwards.

So, if you have a machine that you believe should have CPUs running at a higher frequency, inspect the bios_limit files to see if the BIOS is mis-configured.

Tuesday, 27 September 2011

More interesting uses for SystemTap

Now and again while debugging systems I would like to be able to evaluate an ACPI method or object and see what it returns.    To solve this problem in a generic way I put a little bit of effort today in developing a short SystemTap script that allows me to run acpi_evaluate_object() on a given named object and to dump out any returned data.

I've wrapped the gory details up into a small wrapper function called which is passed just the name of the object to evaluate and if necessary an acpi_object_list containing the arguments to be passed into a method call.   For methods that don't require arguments, we end up with a simple call like the  following:

                acpi_eval("_BIF", NULL);

..and for more complex examples, with arguments, we have:

        struct acpi_object_list arg_list;
        union acpi_object args[1];

        args[0].type = ACPI_TYPE_INTEGER;
        args[0].integer.value = 1; 
        arg_list.count = 1;
        arg_list.pointer = args; 
        acpi_eval("_WAK", &arg_list);

The script is designed to allow one to hack away and add in the calls to the objects that one requires to be evaluated, quick-n-dirty, but it does the job.

It was surprisingly easy to get this up and running with SystemTap once I had figured how to dump out the evalated objects to the tty - the script is fairly compact and the main bulk of the code is a terse error message table.

The script can be found it my SystemTap git repo: git://

SystemTap print statements from "embedded C" functions.

SystemTap provides a flexible programming language to prototype debugging scripts very quickly.  Sometimes however, one has to use "embedded C" functions in a SystemTap script to interface more deeply with the kernel. 

Today I was writing a script to dump out ACPI object names and required some embedded C in my SystemTap script to walk the ACPI namespace and this required a C callback function.   However, inside the C callback I wanted to print the handle and name of the ACPI object but couldn't figure out how to use the native SystemTap print() functions from within embedded C code.    So I crufted up a simple "HelloWorld" SystemTap script and ran it with -k to keep the temporary sources and then had a look at the automagically generated code.

It appears that SystemTap converts the script print statements into _stp_printf()  C calls, so I just plugged these into my C callback instead of using printk().  Now my output goes via the underlying SystemTap print mechanism and appears on the tty rather than going to the kernel log.  Bit of a hack, but the result is easy to use.  I wish it was documented though.

Here is a sample of the original script to illustrate the point:

 #include <acpi/acpi.h>  
 static acpi_status dump_name(acpi_handle handle, u32 lvl, void *context, void **rv)  
     struct acpi_buffer buffer = {ACPI_ALLOCATE_BUFFER};  
     int *count = (int*)context;  
     if (!ACPI_FAILURE(acpi_get_name(handle, ACPI_FULL_PATHNAME, &buffer))) {  
         _stp_printf(" %lx %s\n", handle, (char*)buffer.pointer);  
     return AE_OK;  

Monday, 26 September 2011

Mac Mini rebooting tweaks: setpci -s 0:1f.0 0xa4.b=0

Last night I was asked why Mac Minis require "setpci -s 0:1f.0 0xa4.b=0" to force the Mac to auto-reboot in the event of a power failure.  Well, after a lot of Googling around I found that this setpci rune is quoted in a lot of places and at a guess probably originated from advice on the Mythical Beasts website. However, the explanation of what this rune actually did was distinctly lacking.

So, why is it required?

After some more searching around I found that device 00:1f.0 on the Mac Mini refers to:

00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface Bridge (rev 02) my next step was to figure out why writing a zero byte to register 0xa4 on this device allows the Mac Mini to reboot. I located and download the ICH7 PDF from Intel and register at offset 0xa4 can be found in section  This refers to GEN_PMCON_3—General PM Configuration 3 Register.   Even though the setpci command is clearing this whole register, I suspect we are just interested in clearing bit zero. The PDF states:

"AFTERG3_EN — R/W. This bit determines what state to go to when power is re-applied after a power failure (G3 state). This bit is in the RTC well and is not cleared by any type of reset except writes to CF9h or RTCRST#.

0 = System will return to S0 state (boot) after power is re-applied.
1 = System will return to the S5 state (except if it was in S4, in which case it will return to S4). In the S5 state, the only enabled wake event is the Power Button or any enabled wake event that was preserved through the power failure."

So, it looks like the "setpci -s 0:1f.0 0xa4.b=0" magic is just to return to a S0 (boot state) after power is re-applied after a power failure.   All is explained, so not so magical after all.

Sunday, 25 September 2011

Exponential Growth of Patents

The United States Patent and Trademark Office (USPTO) recently publish an interesting article about the millions of patents issued by the United States of America.  Using the current numbering system, patent #1 was issued in 1836 and patent #8,000,000 was recently issued in August this year.   I plotted issue date against patent number and lo and behold we get exponential growth of patents since the turn of the 20th century:

At this rate, we will see 8 million more patents issued by the end of 2016.  Intellectual property is abundant, perhaps too much so.  Are all these patents totally valid?  Is there any kind of quality control being applied?  Personally, I doubt it.  I don't want to be alarmist, but I really think this is getting totally out of control.

Patents are used as trading tokens as big businesses wage war against each other.  Companies are loading their war-chests with patent portfolios to block rivals from bringing to market innovative new products which leads to a product monoculture.   More perversely, patents are being used so sue users of technology rather than manufacturers.  Ultimately the consumer is the loser and patent lawyers and big business are the winners -  that's the price for patents protecting innovation.

Thursday, 22 September 2011

Tweaking partitions for optimal use of the HDD

By default Ubuntu is installed with the root filesystem at the start of the disk drive and with swap right at the end.    If one analyses the read/write performance of a hard disk drive (HDD) one will quickly spot that the I/O rates differ depending on the physical location of the data.

From the relatively small sample of laptop and desktop drives that I've looked at it seems that reads from the logical start of the drive are fastest and drop off down to roughly half that rate near the end of the drive.    The rate is higher for data on the outer tracks (because there are more data sectors) and lower toward the inner tracks (fewer data sectors).

Since my new 7200rpm 250GB drive performs fastest at the lowest logical block locations,  it makes sense to construct my partitions to utilise this.  For my configuration, I want to load my kernels and initrd in quickly and be able to swap and hibernate fairly quickly too.  Next I want applications to load quickly, and my user data (such as mp3s, cached Email, etc) I care less about for performance.   So, with these constraints, I created separate partitions in this order:

1st /boot (ext4), 2nd swap, 3rd / (ext4) and 4th /home (ext4).

Some quick'n'dirty write benchmarks show me that:

/boot : 84.74 MB/s
swap  : 84.44 MB/s
/     : 82.26 MB/s
/home : 73.30 MB/s this should make booting, swapping and hibernating just slightly faster.   Over the lifetime of the drive the random file writes and deletions in /home won't cause /boot new kernels and initrd images to be fragmented because the are on separate partitions.   Also I can avoid over-writing all my user data in /home if I do a clean installation of Ubuntu into /boot and / at a later date.

Sunday, 18 September 2011

Laptop HDD woes

I do quite a bit of international travelling and my old klunky Lenovo 3000N200 takes a few knocks and consequently I've had to purchase my 2nd HDD for this laptop in the past 3.5 years.

Last week my laptop hung for tens of seconds while logging in - and once more again today.  Looking at the kernel log I was able to see repeated time-outs on read errors which was a little alarming.   The palimpsest utility showed that I had a few bad sectors and there were a few pending to be remapped.   I had a quick look at the S.M.A.R.T. data using:

sudo smartctl -d ata -a /dev/sda

..and saw that I'd got 5311 hours of use out of the drive and considering I bought it about 400 days ago works out to be ~13.25 hours of usage per day on average.  Peeking at  /sys/fs/ext4/sda*/lifetime_write_kbytes it appeared I had written 1.4TB of data, which works out to be 0.27GB of writes per hour of use on average - which sounds fair as my laptop is mainly used for Web, Email and the occasional bit of compilation (as I do most kernel builds on large servers).

So what do I replace it with?  Well, being a cheapskate, I did not want to splash out on an expensive SSD on this relatively old laptop (which I will palm off to my kids fairly soon), so I went for an spinny disk upgrade.  My original drive was a 160GB 5400rpm WD1600BEVT - this time I spent an extra £5 and got a 2500GB 7200rpm WD2500BEKT with double the internal cache and improved read performance - the postage was free from so double win.

Saturday, 17 September 2011

Formatting Source Code in Blogs

At last, I've found a useful tool for producing correctly formatted source code for the inclusion into my blog.

Thanks to codeformatter, one can paste in source, select the appropriate formatting style options and produce blog formatted output to paste into one's blog articles! Easy!

Reading MTRRs via the MTRRIOC_GET_ENTRY ioctl()

The MTRRIOC_GET_ENTRY ioctl() is a useful but under-used ioctl() for reading the MTRR configuration from /proc/mtrr.  Instead of having to read and parse /proc/mtrr, the ioctl() provides a simple interface to easily fetch each MTRR.

A struct mtrr_gentry is passed to the ioctl() with the regnum member set to the MTRR register one wants to read. After a successful ioctl() call, size member of struct mtrr_gentry is less than 1 if the MTRR is disabled, otherwise it is populated with the MTRR register configuration.

Below is an example showing how to use MTRRIOC_GET_ENTRY:

 #include <stdio.h>  
 #include <stdlib.h>  
 #include <string.h>  
 #include <sys/types.h>  
 #include <sys/stat.h>  
 #include <sys/ioctl.h>  
 #include <asm/mtrr.h>  
 #include <fcntl.h>  
 #define LONGSZ    ((int)(sizeof(long)<<1))  
 int main(int argc, char *argv[])  
     struct mtrr_gentry gentry;  
     int fd;  
     static char *mtrr_type[] = {  
         "Write Combining",  
         "Write Through",  
         "Write Protect",  
         "Write Back"  
     if ((fd = open("/proc/mtrr", O_RDONLY, 0)) < 0) {  
         fprintf(stderr, "Cannot open /proc/mtrr!\n");  
     memset(&gentry, 0, sizeof(gentry));  
     while (!ioctl(fd, MTRRIOC_GET_ENTRY, &gentry)) {  
         if (gentry.size < 1)   
             printf("%u: Disabled\n", gentry.regnum);  
             printf("%u: 0x%*.*lx..0x%*.*lx %s\n", gentry.regnum,  
                 LONGSZ, LONGSZ, gentry.base,   
                 LONGSZ, LONGSZ, gentry.base + gentry.size,  

The downside to using MTRRIOC_GET_ENTRY is that MTRR base addresses > 4GB get returned as zero, which is a known "feature" of this interface - the offending code in mtrr_ioctl() in arch/x86/kernel/cpu/mtrr/if.c is as follows:

 /* Hide entries that go above 4GB */  
 if (gentry.base + size - 1 >= (1UL << (8 * sizeof(gentry.size) - PAGE_SHIFT))  
   || size >= (1UL << (8 * sizeof(gentry.size) - PAGE_SHIFT)))  
     gentry.base = gentry.size = gentry.type = 0;  
 else {   
     gentry.base <<= PAGE_SHIFT;  
     gentry.size = size << PAGE_SHIFT;  
     gentry.type = type;   

..clearly showing that gentry.base, .size and .type are set to zero for entries > 4GB.

Thursday, 8 September 2011

Firmware Test Suite presentation at Linux Plumbers Conference

This week I'm attending the Linux Plumbers Conference in Santa Rosa, CA.  Yesterday I gave a brief presentation of the Firmware Test Suite in the Development Tools, and for reference, I've uploaded the slides here.

Monday, 29 August 2011

Finding unwanted global variables and functions with dwarves

The dwarves package contains a set of useful tools that use the DWARF information placed in the ELF binaries by the compiler. Utilities in the dwarves package include:

pahole: This will find alignment holes in structs and classes in languages such as C/C++.  With careful repacking one can achieve better cache hits.  I could have done with this when optimising some code a few years back...

codiff:  This is a diff like tool use to compare the effect a change in source code can create on the compiled code.

pfunct:  This displays information about functions, inlines, goto labels, function size, number of variables and much more.

pdwtags: A DWARF information pretty-printer.

pglobal:  Dump out global symbols.

prefcnt:  A DWARF tags usage count.

dtagnames: Will lists tag names.

So, using pglobal, I was able to quickly check which variables I had made global (or accidentally not made them static!) on some code that I was developing as follows:

pglobal -v progname

and the same for functions:

pglobal -f progname

Easy!  Obviously these tools only work if the DWARF information is not stripped out.

All in all, these are really useful tools to know about and will help me in producing better code in the future.

Wednesday, 24 August 2011

Firmware Test Suite full documentation.

I've now competed the documentation of the Firmware Test Suite and this include documenting each of the 50+ tests (which was a bit of a typing marathon).  Each test has a brief description of what the test covers, example output from the test, how to run the test (possibly with different options) and explanations of test failure messages.

For example of the per-test documentation, check out the the suspend/resume test page and the ACPI tables test page.

I hope this is useful!

Monday, 22 August 2011

Assume Nothing.

When I was a very junior software engineer working on Fortran 77 signal processing modules on MicroVaxes, PDP-11s, Masscomps and old 286 PCs I was given some very wise words by the owner of the company:   "Assume nothing".   This has stuck with me for nearly quarter of a century.  It is pithy, easy to remember and is so true for software engineering.

1. "Assume nothing" makes me look up details when I'm not 100% sure.
2. "Assume nothing" means that I double check my facts when I think I'm 100% sure.
3. "Assume nothing" makes me question even the so called 'obvious'.  "Of course it will work.." turns into "are we sure it will work for every possible case?"
4. "Assume nothing" makes me dot the i's and cross the t's.
5. "Assume nothing" keeps me sceptical, which is useful as there is a lot of stupidity masquerading as knowledge on the internet.

I could ramble on. However enough said. Just assume nothing, it will keep you out of a lot of trouble.

How x86 computers boot up

Gustavo Duarte has written a concise and very readable article describing how computers boot up.  Well worth reading.

Friday, 19 August 2011

Which version of the GCC compiled my program?

Here's a quick one-liner to find out which version of GCC was used to compile some code:

readelf -p .comment a.out

String dump of section '.comment':
  [     0]  GCC: (Ubuntu/Linaro 4.6.1-7ubuntu1) 4.6.1
  [    2a]  GCC: (Ubuntu/Linaro 4.6.1-6ubuntu6) 4.6.1

Thursday, 18 August 2011

Debugging S3 suspend/resume using SystemTap and minimodem.

Some problems are a little challenging to debug and require sometimes a bit of lateral thinking to solve.   One particular issue is when suspend/resume locks up and one has no idea where or why because the console has is suspended and any debug messages just don't appear.

In the past I've had to use techniques like flashing keyboard LEDs, making the PC speaker beep or even forcible rebooting the machine at known points to be able to get some idea of roughly where a hang has occurred.   This is fine, but it is tedious since we can only emit a few bits of state per iteration.   Saving state is difficult since when a machine locks up one has to reboot it and one looses debug state.   One technique is to squirrel away debug state in the real time clock (RTC) which allows one to store twenty or so bits of state, which is still quite tough going.

One project I've been working on is to use the power of system tap to instrument the entire suspend/resume code paths - every time a function is entered a hash of the name is generated and stored in the RTC.  If the machine hangs, one can then grab this hash out of the RTC can compare this to the known function names in /proc/kallsyms, and hopefully this will give some idea of where we got to before the machine hung.

However, what would be really useful is the ability to print out more debug state during suspend/resume in real time.   Normally I approach this by using a USB/serial cable and capturing console messages via this mechanism.  However, once USB is suspended, this provides no more information.

One solution I'm now using is with Kamal Mostafa's minimodem.  This wonderful tool is an implementation of a software modem and can send and receive data by emulating a Bell-type or RTTY FSK modem.  It allows me to transmit characters at 110 to 300 baud over a standard PC speaker and reliably receive them on a host machine.  If the wind is in the right direction, one can transmit at higher speeds with an audio cable plugged in the headphone jack of the transmitter and into the microphone socket on the receiver if hardware allows.

The 8254 Programmable Interval-timer on a PC can be used to generate a square wave at a predefined frequency and can be connected to the PC speaker to emit a beep.  Sending data using the speaker to minimodem is a case of sending a 500ms leader tone, then emitting characters.  Each character has a 1 baud space tone, followed by 8 bits (least significant bit first) with a zero being a 1 baud space tone and a 1 being represented by a 1 baud mark tone, and the a trailing bunch of stop bits.

So using a prototype driver written by Kamal, I tweaked the code and put it into my suspend/resume SystemTap script and now I can dump out messages over the PC speaker and decode them using minimodem.  300 baud may not be speedy, but I am able to now instrument and trace through the entire suspend/resume path.

The SystemTap scripts are "work-in-progress" (i.e. if it breaks you keep the pieces), but can be found in my pmdebug git repo git://  The README file gives a quick run down of how to use this script and I have written up a full set of instructions.

The caveat to this is that one requires a PC where one can beep the PC speaker using the PIT.  Lots of modern machines seem to either have this disabled, or the volume somehow under the control of the Intel HDA audio driver.  Anyhow, kudos to Kamal for providing minimodem and giving me the prototype kernel driver to allow me to plug this into a SystemTap scrip.

Wednesday, 17 August 2011

Fragmentation on ext2/ext3/ext4 filesystems

If you want know how fragmented a file is on an ext2/ext3/ext4 filesystem there are a couple of methods of finding out.  One method is to use hdparm, but one needs CAP_SYS_RAWIO capability to do so, hence run with sudo:

sudo hdparm --fibmap /boot/initrd.img-2.6.38-9-generic

 filesystem blocksize 4096, begins at LBA 24000512; assuming 512 byte sectors.
 byte_offset  begin_LBA    end_LBA    sectors
           0   35747840   35764223      16384
     8388608   39942144   39958527      16384
    16777216   40954664   40957423       2760

Alternatively, one can use the filefrag utility (part of the e2fsprogs package) for reporting the number of extents:

filefrag /boot/initrd.img-2.6.38-9-generic
/boot/initrd.img-2.6.38-9-generic: 3 extents found

..or more verbosely:

filefrag -v /boot/initrd.img-2.6.38-9-generic
Filesystem type is: ef53
File size of /boot/initrd.img-2.6.38-9-generic is 18189027 (4441 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0  1468416            2048
   1    2048  1992704  1470463   2048
   2    4096  2119269  1994751    345 eof
/boot/initrd.img-2.6.38-9-generic: 3 extents found

Well, that's useful. So, going one step further, how many free extents are available on the filesystem? Well, e2freefrag is the tool for this:

sudo e2freefrag /dev/sda1
Device: /dev/sda1
Blocksize: 4096 bytes
Total blocks: 2999808
Free blocks: 1669815 (55.7%)

Min. free extent: 4 KB
Max. free extent: 1370924 KB
Avg. free extent: 6160 KB
Num. free extent: 1084

Extent Size Range :  Free extents   Free Blocks  Percent
    4K...    8K-  :           450           450    0.03%
    8K...   16K-  :            97           227    0.01%
   16K...   32K-  :            99           527    0.03%
   32K...   64K-  :           134          1475    0.09%
   64K...  128K-  :            96          2193    0.13%
  128K...  256K-  :            50          2235    0.13%
  256K...  512K-  :            30          2643    0.16%
  512K... 1024K-  :            36          6523    0.39%
    1M...    2M-  :            36         13125    0.79%
    2M...    4M-  :            14          9197    0.55%
    4M...    8M-  :            15         18841    1.13%
    8M...   16M-  :             8         17515    1.05%
   16M...   32M-  :             7         38276    2.29%
   32M...   64M-  :             3         39409    2.36%
   64M...  128M-  :             3         67865    4.06%
  512M... 1024M-  :             4        802922   48.08%
    1G...    2G-  :             2        646392   38.71%

..thus reminding me to do some housekeeping and remove some junk from my file system... :-)

Saturday, 13 August 2011

What's new in the Firmware Test Suite for Oneiric 11.10?

The Firmware Test Suite (fwts) is still a relatively new tool and hence this cycle I've still been adding some features and fixing bugs.  I've been running fwts against large data sets to soak test the tool to catch a lot of stupid corner cases (e.g. broken ACPI tables). Also, I am focused on getting some better documentation written (this is still "work in progress").

New tests for the Oneiric 11.10 release are as follows:

    Sanity check tables against the MultiProcessor Specification (MPS). For more information about MPS, see the wikipedia MPS page.

    Dump annotated MPS tables.

    Sanity check Model Specific Registers across all CPUs. Does some form of MSR default field sanity checking.

    Very simple suspend (S3) power checks.  This puts the machine into suspend and attempts to measure the power consumed while suspended. Generally this test gives more accurate results the longer one suspends the machine.  Your mileage may vary on this test.

     Hex dump of the Extended BIOS Data Area.

In addition to the above, the fwts "method" test is now expanded to evaluate and exercise over 90 ACPI objects and methods.

One can also join the fwts mailing list by going to the project page and subscribing.

Thursday, 28 July 2011

Making my 2nd Webcam the default for Empathy

It just so happens that I have two Webcams on my machine, one being a rather poor one built into the laptop and a 2nd better quality Logitech webcam.

Using the 2nd webcam by default in Empathy for video conference calls required a little bit of hackery with gconf-editor by changing /system/gstreamer/0.10/default/videosrc from v4l2src to vl4l2src device="/dev/video1"

This wasn't entirely the most user friendly way to configure the default. Ho hum..

Tuesday, 26 July 2011

The semantics of halt.

It appears that the semantics of halt mean it will stop the machine but it may or may not shut it down.  Back when I used UNIX boxes, halt basically stopped the machine but never powered it down; to power it down one had to explicitly use "halt -p".

So things change. With upstart, halt is a symbolic link to reboot and reboot calls shutdown -h.  The man page to shutdown states for the -h option:

"Requests that the system be either halted or powered off after it has been brought down, with the choice as to which left up to the system."

Hrm, so this vaguely explains why halting some machines may just halt and on others it may also shut the system down.  I've not digged into this thoroughly yet, but one suspects that for different processor architectures we get different implementations.  Even for x86 we have variations in CPUs and boards/platforms, so it really it is hard to say if halt will power down a machine.

The best bet is to assume halt just halts and if you want it to power down always use "halt -p".

Sunday, 24 July 2011

Using the PCI sysfs interface to dump the Video BIOS ROM

The Linux PCI core driver provides a useful (and probably overlooked) sysfs interface to read PCI ROM resources.  A PCI device that has a ROM resource will have a "rom" sysfs file associated with it, writing anything other than 0 to it will enable one to then read the ROM image from this file.

For example, on my laptop, to find PCI devices that have ROM images associated with them I used:

find /sys/devices -name "rom"

and this corresponds to my Integrated  Graphics Controller:

lspci | grep 02.0
00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (primary) (rev 0c)

To dump the ROM I used:

echo 1 | sudo tee /sys/devices/pci0000\:00/0000\:00\:02.0/rom
sudo cat /sys/devices/pci0000\:00/0000\:00\:02.0/rom > vbios.rom

To disassemble this I used ndisasm:

sudo apt-get install nasm
ndisasm -k 0,3 vbios.rom | less

..and just use strings on the ROM image to dump out interesting text, e.g.

strings vbios.rom
00IBM VGA Compatible BIOS.

..and then used a tool like bvi to edit the ROM.

Friday, 15 July 2011

Editing binaries using bvi

I don't want to start a vi vs emacs holy war, but I have found a wonderful binary editor based on the vi command set.  The bvi editor is great for tweaking binary files - one can tab between the hex and ASCII representations of the binary data to re-edit the data.

To install use:

sudo apt-get install bvi

As well as basic vi commands, bvi contains bit-level operations to and/or/xor/not/shift bits too.   One has to use ':set memmove' to insert/delete data as this is not enabled by default.

Anyhow, install, consult the man page and get editing binary files!

Wednesday, 6 July 2011

Debugging I/O reads/writes used by the ACPI core using SystemTap

SystemTap is a very useful and powerful tool that enables one to insert kernel debug into a running kernel.  Today I wanted to inspect the I/O read/write operations occurring when running some ACPI AML, so it was a case of hacking up a few lines of system tap to dump out the relevant state (e.g. which port being accessed, width of the I/O operation in bits and value being written or read).

So instead of spinning a bespoke kernel with a few lines of debug in, I use SystemTap to quickly write a re-usable script.  Simple and easy!

I've put the SystemTap script in my git repository for any who are interested.

Wednesday, 29 June 2011

PC Speaker weirdnesses

Today we were trying to do some debugging by getting some tones out of a laptop speaker by frobbing bit 1 of port 0x61 on the keyboard controller.  Rather unexpectedly I got no sound whatsoever out of the speaker, yet I had managed to do so the day before.   So I double checked what had changed since the day before:

1. Was it because I upgraded my kernel?
2. Did I unexpected disabled the speaker when tweaking BIOS settings?
3. Was it something interfering with my port 0x61 bit twiddling?
4. Was the hardware now broken?

As per usual, I first assumed that the most complex parts of the system were to blame as they normally can go wrong in the most subtle way.  After a lot of fiddling around I discovered that the PC speaker only worked when I plugged the AC power into the laptop.  Now that wasn't obvious.

I suspect I should have applied Occam's Razor to this problem to begin with. We live and learn...

Friday, 24 June 2011

dstat - a replacement for vmstat, iostat and ifstat

Normally when I see a problem to do with CPU, I/O, network or memory resource hogging I turn to my trusty tools such as vmstat, iostate or ifstat to check system behaviour.

I stumbled upon dstat today, which boasts to be "a versatile replacement for vmstat, iostat and ifstat.".  So let's see what it can do. First, install with:

sudo apt-get install dstat

Running dstat with no arguments displays CPU stats (user,system,idle,wait,hardware interrupts, software interrupts), Disk I/O stats (read/write), Network stats (receive,send), Paging stats (in/out) and System stats (interrupts, context switches).   Not bad at all. 

Dstat even highlights in colour the active values, so you don't miss relevant deltas in the statistics being churned out by the tool.  Colours can be disabled with the --nocolor option.

But there is more. There are a plethora of options to show various system activities, such as:

-l       load average
-m       memory stats (used, buffers, cache, free)
-p       process stats (runnable, uninterruptible, new)
--aio    aio stats (asynchronous I/O)
--fs     filesystem stats (open files, inodes)
--ipc    ipc stats (message queue, semaphores, shared memory)
--socket socket stats (total, tcp, udp, raw, ip-fragments)
--tcp    TCP stats (listen, established, syn, time_wait, close)
--udp    UDP stats (listen, active)
--vm          virtual memory stats (hard pagefaults, soft pagefaults, allocated, free) name but a few.  As it is, this already is more functional that vmstat, iostat and ifstat.   But that's not all - dstat is extensible by the use of dstat plugins and the packaged version of dstat comes shipped with a few already, for example:

--battery          battery capacity
--cpufreq          CPU frequency
--top-bio-adv  top block I/O activity by process
--top-cpu-adv  most expensive CPU process
--top-io      most expensive I/O process
--top-latency  process with highest latency

..and many more besides - consult the manual page for dstat to see more.

So, dstat really is a Swiss army knife - a useful tool to get to know and use whenever you need to quickly spot misbehaving processes or devices.

Thursday, 23 June 2011

Transitioning a PC to S5 (revisited)

Back in February I wrote about turning off a PC using the Intel I/O Controller Hub and had some example code to do this, and it was a dirty hack.  I've revised this code to be more aligned with how the Linux kernel does this, namely:

1. Cater for the possible existence of PM1b_EVT_BLK and PM1b_CNT_BLK registers.
2. Clear WAKE_STATUS before transitioning to S5
3. Instead of setting SLP_TYP and SLP_EN on, one sets SLP_TYPE and then finally sets SLP using separate writes.

The refined program requires 4 arguments, namely the port address of the PM1a_EVT_BLK, PM1b_EVT_BLK, PM1a_CNT_BLK and PM1b_CNT_BLK.  If the PM1b_* ports are not defined, then these should be 0.

To find these ports run either:

cat /proc/ioports | grep PM1

or do:

sudo fwts acpidump - | grep PM1 | grep Block:

For example, on a Sandybridge laptop, I have PM1a_EVT_BLK = 0x400, PM1a_CNT_BLK = 0x404 and the PM1b_* ports are not defined, so I use:

sudo ./halt 0x400 0 0x404 0

..and this will transition the laptop to the S5 state very quickly. Needless to say, make sure you have sync'd and/or unmounted your filesystems before doing this.

The PM1_* port addresses from /proc/ioport and the fwts acpidump come from the PM1* configuration data from the ACPI FACP, so if this is wrong, then powering down the machine won't work.  So, if you can't shutdown your machine using this example code then it's possible the FACP is wrong. At this point, one should sanity check the port addresses using the appropriate Southbridge data sheet for your machine - generally speaking look for:

PM1_STS—Power Management 1 Status Register (aka PM1a_EVT_BLK)
PM1_CNT—Power Management 1 Control  (aka PM1a_CNT_BLK)

..however these are offsets from PMBASE which are defined in the LPC Interface PCI Register Address Map so you may require a little bit of work to figure out the addresses of these registers on your hardware.

Wednesday, 22 June 2011

Using SystemTap to do runtime AML tracing

The ACPI engine in the kernel can be debugged by building with CONFIG_ACPI_DEBUG and configuring /sys/module/acpi/parameters/debug_layer and /sys/module/acpi/parameters/debug_level appropriately.   This can provide a wealth of data and is generally a very powerful debug state tracing mechanism.  However, there are times when one wants to get a little more debug data out or perhaps just drill down on a specific core area of functionality without being swamped by too much ACPI debug. This is where tools like SystemTap are useful.

SystemTap is a very powerful tool that allows one to add extra debug instrumentation into a running kernel without the hassle and overhead of rebuilding a kernel with debug printk() statements in. It allows very quick turnaround in writing debug and one does not have to reboot a machine to load a new kernel since the debug is loaded and unloaded dynamically.

SystemTap has its own scripting language for writing debug scripts, but for specialised hackery it provides a mechanism ('guru mode') to embed C directly which can be called from the SystemTap script.   The SystemTap language is fairly small and easy to understand and one easily becoming proficient with the language in a day.

The only downside is that one requires a .ddeb kernel package which is huge since it contains all the necessary kernel debug information. 

Over the past week  I have been looking at debugging various aspects of the ACPI core, such as fulling tracing suspend/resume and dumping out executed AML code at run time.   I was able to quickly prototype a SystemTap script that dumps out AML opcodes on the Oneiric kernel - this saved me the usual build of a debug kernel with CONFIG_ACPI_DEBUG enabled and then capturing the appropriate debug and wading through copious amounts of debug data.

Conclusion: Some initial investment in time and effort is required to understand SystemTap (and to get to grips with the more useful features in 'guru mode'). However, one can be far more productive because the debug cycle is made far more efficient. Also, SystemTap provides plenty of functionality to allow very detailed and targeted debugging scripts.

Wednesday, 15 June 2011

Dumping the contents of the Embedded Controller

Dumping the contents of the Embedded Controller (EC) can be useful when debugging some x86 BIOS/kernel related issues.  At a hardware level to get access to the EC memory one goes via the EC command/status and data port.   As a side note, one can determine these ports as follows:

cat /proc/ioports  | grep EC

The preferred way to access these is via the ACPI EC driver in drivers/acpi/ec.c which is used by the ACPI driver to handle read/write operations to the EC memory region.

In addition to this driver, there the ec_sys module that provides a useful debugfs interface to allow one to read + write to the EC memory.  Write support is enabled with the ec_sys module parameter 'write_support' but it is generally discouraged as one may be poking data into memory may break things in an unpredictable manner, hence by default write support is disabled.

So, to dump the contents of the first EC (assuming debugfs is mounted), do:

sudo modprobe ec_sys
sudo od -t x1 /sys/kernel/debug/ec/ec0/io


As a bonus, the General Purpose Event bits are also readable from /sys/kernel/debug/ec/ec0/gpe.

Before I stumbled upon ec_sys.c I used a SystemTap script to execute ec_read() in ec.c to do the reading directly.  Yes it's ugly and stupid, but it does prove SystemTap is a very useful tool.

Thursday, 26 May 2011

GNU EFI lib and Hello World

Writing UEFI applications takes a little bit of effort to get started, fortunately the GNU EFI library provides enough infrastructure to make it far less painful.  The crunch points are:

1. Understanding which UEFI protocols to use, fortunately these are well documented in the UEFI specification. Unfortunately the specification is a couple of thousand pages long, so be prepared to spend some time working out how it all hangs together.

2. Figuring out how to call the protocols using the GNU EFI wrapper shim.

3. Generating a UEFI PE COFF image using some objcopy runes.

So, the best way to start was to write a "Hello World" program and to work up from there.  First, I installed the gnu-efi development library using:

sudo apt-get install gnu-efi

Next I hacked up the classic "Hello World", this can be found here.  Then considerable amount of Googling was required to figure out how to use objcopy to help create the desired executable image.  The Makefile containing these runes can be found here.

So what next?  Well, this little adventure is my first step in a quest to write a bunch of tests so ensure UEFI firmware has functional protocols to allow grub to load a kernel.  I was hoping to re-use the firmware test suite framework but it may take some effort since I don't have any POSIX support on UEFI, so I'm going to have to scope out how much effort is required to port the framework to UEFI.

Anyhow, it's a start.

Monday, 16 May 2011

Dumping UEFI variables

UEFI variables in Linux can be found in /sys/firmware/efi/vars on UEFI firmware based machine, however, the raw variable data is in a binary format and hence not in a human readable form.   The Ubuntu Natty firmware test suite contains the uefidump tool to extract and decode the binary data into a more human readable form.

To run, use:

sudo fwts uefidump -

and you will see something similar to the following:

Name: AuthVarKeyDatabase.
  GUID: aaf32c78-947b-439a-a180-2e144ec37792
  Attr: 0x17 (NonVolatile,BootServ,RunTime).
  Size: 1 bytes of data.
  Data: 0000: 00                                               .

Name: Boot0000.
  GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
  Attr: 0x7 (NonVolatile,BootServ,RunTime).
  Active: Yes
  Info: Primary Master Harddisk
  Path: \BIOS(2,0,Primary Master Harddisk).

Name: Boot0001.
  GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
  Attr: 0x7 (NonVolatile,BootServ,RunTime).
  Active: Yes
  Info: EFI Internal Shell
  Path: \Unknown-MEDIA-DEV-PATH(0x7)\Unknown-MEDIA-DEV-PATH(0x6).

Name: Boot0003.
  GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
  Attr: 0x7 (NonVolatile,BootServ,RunTime).
  Active: Yes
  Info: ubuntu
  Path: \HARDDRIVE(1,22,9897,0f52a6e132775546,ab,f6)\FILE('\EFI\ubuntu\grubx64.efi').

Name: Boot0004.
  GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
  Attr: 0x7 (NonVolatile,BootServ,RunTime).
  Active: Yes
  Path: \ACPI(0xa0341d0,0x0)\PCI(0x2,0x1f)\ATAPI(0x0,0x1,0x0).

Name: BootOptionSupport.
  GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
  Attr: 0x6 (BootServ,RunTime).
  BootOptionSupport: 0x0303.

Name: BootOrder.
  GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
  Attr: 0x7 (NonVolatile,BootServ,RunTime).
  Boot Order: 0x0003,0x0000,0x0001,0x0004,0x0005,0x0006.

Name: ConIn.
  GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
  Attr: 0x7 (NonVolatile,BootServ,RunTime).
  Device Path: \ACPI(0xa0341d0,0x0)\PCI(0x0,0x1f)\ACPI(0x50141d0,0x0)\UART(115200 baud,8,1,1)\VENDOR(11d2f9be-0c9a-9000-273f-c14d7f010400)\USBCLASS(0xffff,0xffff,0x3,0x1,0x1).

Name: ConInDev.
  GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
  Attr: 0x6 (BootServ,RunTime).
  Device Path: \ACPI(0xa0341d0,0x0)\PCI(0x0,0x1f)\ACPI(0x50141d0,0x0)\UART(115200 baud,8,1,1)\VENDOR(11d2f9be-0c9a-9000-273f-c14d7fff0400).

Name: Setup.
  GUID: 038bcef0-21e2-49d1-a47c-b7257296b980
  Attr: 0x7 (NonVolatile,BootServ,RunTime).
  Size: 114 bytes of data.
  Data: 0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Data: 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Data: 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Data: 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Data: 0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Data: 0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Data: 0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Data: 0070: 01 00   

The tool will try to decode the binary data, however, if it cannot identify the variable type it will resort to doing a hex dump of the data instead.

Wednesday, 27 April 2011

How much data has been written to your ext4 partitions?

The ext4 file system has a bunch of per-device /sys entries in /sys/fs/ext4/ that can be used to inspect and change ext4 tuning parameters.   One of the read-only values available is lifetime_write_kbytes which shows the number of kilobytes of data written to the file system since it was created.   

For example, to see how much data was written on an ext4 filesystem on /dev/sda5, one uses:

cat /sys/fs/ext4/sda5/lifetime_write_kbytes

To see how much data in has been written in kilobytes since mount time, read  session_write_kbytes, for example:

cat /sys/fs/ext4/sda5/session_write_kbytes

For a full description of all the tunables, consult Documentation/ABI/testing/sysfs-fs-ext4 in the Linux source.

Tuesday, 26 April 2011

lstopo - display a topological map of a system

The hwloc (hardware locality) package contains the useful tool lstopo. To install use:

sudo apt-get install hwloc

By default, lstopo will display a logical view of the system caches and CPU cores, for example:

To get a non-graphical output use:

lstopo -

Machine (1820MB) + Socket #0 + L3 #0 (3072KB)
  L2 #0 (256KB) + L1 #0 (32KB) + Core #0
    PU #0 (phys=0)
    PU #1 (phys=2)
  L2 #1 (256KB) + L1 #1 (32KB) + Core #1
    PU #2 (phys=1)
    PU #3 (phys=3)

lstopo is also able to output the toplogy image in a variety of formats (Xfig, PDF, Postscript, PNG, SVG and XML) by specifying the output filename and extension, e.g.

lstopo topology.pdf

For more information, consult the manual for hwloc and lstopo.