Canonical Voices

Posts tagged with 'math'

niemeyer

As originally shared on Google+, and as a follow up of the previous post covering OpenGL on Go QML, a new screencast was published to demonstrate the latest features introduced around OpenGL support in Go QML:

Refrences:

Read more
niemeyer

Here is a small programming brain teaser for the weekend:

Assume uf is an unsigned integer with 64 bits that holds the IEEE-754 representation for a binary floating point number of that size.

The questions are:

1. How to tell if uf represents an integer number?

2. How to serialize the absolute value of such an integer number in the minimum number of bytes possible, using big-endian ordering and the 8th bit as a continuation flag? For example, float64(1<<70 + 3<<21) serializes as:

"\x81\x80\x80\x80\x80\x80\x80\x83\x80\x80\x00"

The background for this problem is that the current draft of the strepr specification mentions that serialization. Some languages, such as Python and Ruby, implement transparent arbitrary precision integers, and that makes implementing the specification easier.

For example, here is a simple Python interactive session that arrives at the result provided above exploring the native integer representation.

>>> f = float((1<<70) + (3<<21))
>>> v = int(f)
>>> l = [v&0x7f]
>>> v >>= 7
>>> while v > 0:
...     l.append(0x80 | (v&0x7f))
...     v >>= 7
... 
>>> l.reverse()
>>> "".join("%02x" % i for i in l)
'8180808080808083808000'

Python makes the procedure simpler because it is internally converting the float into an integer of appropriate precision via standard C functions, and then offering bit operations on the resulting value.

The suggested brain teaser can be efficiently solved using just the IEEE-754 representation, though, and it’s relatively easy because the problem is being constrained to the integer space.

A link to an implementation will be provided next week.

UPDATE: The logic is now available as part of the reference implementation of strepr.

Read more
niemeyer

Note: This is a candidate version of the specification. This note will be removed once v1 is closed, and any changes will be described at the end. Please get in touch if you’re implementing it.

Contents


Introduction

This specification defines strepr, a stable representation that enables computing hashes and cryptographic signatures out of a defined set of composite values that is commonly found across a number of languages and applications.

Although the defined representation is a serialization format, it isn’t meant to be used as a traditional one. It may not be seen entirely in memory at once, or written to disk, or sent across the network. Its role is specifically in aiding the generation of hashes and signatures for values that are serialized via other means (JSON, BSON, YAML, HTTP headers or query parameters, configuration files, etc).

The format is designed with the following principles in mind:

Understandable — The representation must be easy to understand to increase the chances of it being implemented correctly.

Portable — The defined logic works properly when the data is being transferred across different platforms and implementations, independently from the choice of protocol and serialization implementation.

Unambiguous — As a natural requirement for producing stable hashes, there is a single way to process any supported value being held in the native form of the host language.

Meaning-oriented — The stable representation holds the meaning of the data being transferred, not its type. For example, the number 7 must be represented in the same way whether it’s being held in a float64 or in an uint16.


Supported values

The following values are supported:

  • nil: the nil/null/none singleton
  • bool: the true and false singletons
  • string: raw sequence of bytes
  • integers: positive, zero, and negative integer numbers
  • floats: IEEE754 binary floating point numbers
  • list: sequence of values
  • map: associative value→value pairs


Representation

nil = 'z'

The nil/null/none singleton is represented by the single byte 'z' (0x7a).

bool = 't' / 'f'

The true and false singletons are represented by the bytes 't' (0x74) and 'f' (0x66), respectively.

unsigned integer = 'p' <value>

Positive and zero integers are represented by the byte 'p' (0x70) followed by the variable-length encoding of the number.

For example, the number 131 is always represented as {0x70, 0x81, 0x03}, independently from the type that holds it in the host language.

negative integer = 'n' <absolute value>

Negative integers are represented by the byte 'n' (0x6e) followed by the variable-length encoding of the absolute value of the number.

For example, the number -131 is always represented as {0x6e, 0x81, 0x03}, independently from the type that holds it in the host language.

string = 's' <num bytes> <bytes>

Strings are represented by the byte 's' (0x73) followed by the variable-length encoding of the number of bytes in the string, followed by the specified number of raw bytes. If the string holds a list of Unicode code points, the raw bytes must contain their UTF-8 encoding.

For example, the string hi is represented as {0x73, 0x02, 'h', 'i'}

Due to the complexity involved in Unicode normalization, it is not required for the implementation of this specification. Consequently, Unicode strings that if normalized would be equal may have different stable representations.

binary float = 'd' <binary64>

32-bit or 64-bit IEEE754 binary floating point numbers that are not holding integers are represented by the byte 'd' (0x64) followed by the big-endian 64-bit IEEE754 binary floating point encoding of the number.

There are two exceptions to that rule:

1. If the floating point value is holding a NaN, it must necessarily be encoded by the following sequence of bytes: {0x64, 0x7f, 0xf8, 0x00 0x00, 0x00, 0x00, 0x00, 0x00}. This ensures all NaN values have a single representation.

2. If the floating point value is holding an integer number it must instead be encoded as an unsigned or negative integer, as appropriate. Floating point values that hold integer numbers are defined as those where floor(v) == v && abs(v) != ∞.

For example, the value 1.1 is represented as {0x64, 0x3f, 0xf1, 0x99, 0x99, 0x99, 0x99, 0x99, 0x9a}, but the value 1.0 is represented as {0x70, 0x01}, and -0.0 is represented as {0x70, 0x00}.

This distinction means all supported numbers have a single representation, independently from the data type used by the host language and serialization format.

list = 'l' <num items> [<item> ...]

Lists of values are represented by the byte 'l' (0x6c), followed by the variable-length encoding of the number of pairs in the list, followed by the stable representation of each item in the list in the original order.

For example, the value [131, -131] is represented as {0x6c, 0x70, 0x81, 0x03, 0x6e, 0x81, 0x03, 0x65}

map = 'm' <num pairs> [<item key> <item value>  ...]

Associative maps of values are represented by the byte 'm' (0x6d) followed by the variable-length encoding of the number of pairs in the map, followed by an ordered sequence of the stable representation of each key and value in the map. The pairs must be sorted so that the stable representation of the keys is in ascending lexicographical order. A map must not have multiple keys with the same representation.

For example, the map {"a": 4, 5: "b"} is always represented as {0x6d, 0x02, 0x70, 0x05, 0x73, 0x01, 'b', 0x73, 0x01, 'a', 0x70, 0x04}.


Variable-length encoding

Integers are variable-length encoded so that they can be represented in short space and with unbounded size. In an encoded number, the last byte holds the 7 least significant bits of the unsigned value, and zero as the eight bit. If there are remaining non-zero bits, the previous byte holds the next 7 bits, and the eight bit is set on to flag the continuation to the next byte. The process continues until there are non-zero bits remaining. The most significant bits end up in the first byte of the encoded value, which must necessarily not be 0x80.

For example, the number 128 is variable-length encoded as {0x81, 0x00}.


Reference implementation

A reference implementation is available, including a test suite which should be considered when implementing the specification.


Changes

draft1 → draft2

  • Enforce the use of UTF-8 for Unicode strings and explain why normalization is being left out.
  • Enforce a single NaN representation for floats.
  • Explain that map key uniqueness refers to the representation.
  • Don’t claim the specification is easy to implement; floats require attention.
  • Mention reference implementation.

Read more
niemeyer

Our son Otávio was born recently. Right in the first few days, we decided to keep tight control on the feeding times for a while, as it is an intense routine pretty unlike anything else, and obviously critical for the health of the baby. I imagined that it wouldn’t be hard to find an Android app that would do that in a reasonable way, and indeed there are quite a few. We went with Baby Care, as it has a polished interface and more features than we’ll ever use. The app also includes some basic statistics, but not enough for our needs. Luckily, though, it is able to export the data as a CSV file, and post-processing that file with the R language is easy, and allows extracting some fun facts about what the routine of a healthy baby can look like in the first month, as shown below.

Otávio

The first thing to do is to import the raw data from the CSV file. It is a one-liner in R:

> info = read.csv("baby-care.csv", header=TRUE)

Then, this file actually comes with other events that won’t be processed now, so we’ll slice it and grab only the rows and columns of interest:

> feeding <- info[info$Event.type == "Breast",
        c("Event.subType", "Start.Time", "End.Time", "Duration")]

This is how it looks like:

> feeding[100:103,]
    Event.subType       Start.Time         End.Time Duration
129          Left 2013/01/04 13:45 2013/01/04 14:01    00:16
132          Left 2013/01/04 16:21 2013/01/04 16:30    00:09
134         Right 2013/01/04 17:46 2013/01/04 17:54    00:08

Now things get more interesting. Let’s extract that duration column into a more useful vector, and do some basic analysis:

> duration <- as.difftime(as.vector(feeding$Duration), "%H:%M")

> length(duration)
[1] 365

> total = sum(duration)
> units(total) = "hours"
> total
Time difference of 63.71667 hours

> mean(duration)
Time difference of 10.47397 mins
> sd(duration)
[1] 5.937172

A total of 63 hours surprised me, but the mean time of around 10 minutes per feeding is within the recommendation, and the standard deviation looks reasonable. It may be more conveniently pictured as a histogram:

> hist(as.numeric(duration), breaks="FD",
    col="blue", main="", xlab="Minutes")

Duration histogram

Another point we were interested on is if both sides are properly balanced:

> sides <- c("  Right", "  Left")
> tapply(duration, feeding$Event.subType, mean)[sides]
   Right     Left 
10.72283 10.22099

Looks good.

All of the analysis so far goes over the whole period, but how has the daily intake changed over time? We’ll need an additional vector to compute this and visualize in a chart:

> day <- format(strptime(feeding$Start.Time, "%Y/%m/%d %H:%M"),
                "%Y/%m/%d")
> perday <- tapply(duration, day, sum)
> mean(perday)
[1] 136.5357
> sd(perday)
[1] 53.72735
> sd(perday[8:length(perday)])
[1] 17.49735

> plot(perday, type="h", col="blue", xlab="Day", ylab="Minutes")

Daily duration

The mean looks good, with about two hours every day. The standard deviation looks high on a first look, but it’s actually not that bad if we take off the first few days. Looking at the graph shows why: the slope on the left-hand side, which is expected as there’s less milk and the baby has more trouble right after birth.

The chart shows a red flag, though: one day seems well below the mean. This is something to be careful about, as babies can get into a loop where they sleep too much and miss being hungry, the lack of feeding causes hypoglycemia, which causes more sleep, and it doesn’t end up well. A rule of thumb is to wake the baby up every two hours in the first few days, and at most every four hours once he stabilizes for the following weeks.

So, this was another point of interest: what are the intervals between feedings?

> start = strptime(feeding$Start.Time, "%Y/%m/%d %H:%M")
> end = strptime(feeding$End.Time, "%Y/%m/%d %H:%M")
> interval <- start[-1]-end[-length(end)]

> hist(as.numeric(interval), breaks="FD", col="blue",
       main="", xlab="Minutes")

Interval histogram

Seems great, with most feedings well under two hours. There's a worrying outlier, though, of more than 6 hours. Unsurprisingly, it happened over night:

> feeding$End.Time[interval > 300]
[1] 2013/01/07 00:52

It wasn't a significant issue, but we don't want that happening often while his body isn't yet ready to hold enough energy for a full night of sleep. That's the kind of reason we've been monitoring him, and is important because our bodies are eager to get full nights of sleep, which opens the door for unintended slack. As a reward for that kind of control, we've got the chance to enjoy not only his health, but also an admirable mood.

Love, Dad.

Read more
Gustavo Niemeyer

Wiki + Spreadsheet

The underlying concept is very simple: spreadsheets are a way to organize text, numbers and formulas into what might be seen as a natively numeric environment: a matrix. So what would happen if we loosed some of the bolts of the numeric-oriented organization, and tried to reuse the same concepts into a more formatting-oriented environment which is naturally collaborative: a wiki.

While I do encourage you to answer this with some fantastic new online service (please provide me with an account and the best e-book reader device available once you’re rich) I had a try at answering this question myself a while ago by writing the Calc macro for Moin.

Basically, the Calc macro allows extracting values found in a wiki page into lists (think columns or rows), and applying formulas and further formatting as wanted.

I believe there’s a lot of potential on the basic concept, and the prototype, even though functional and useful, surely has a lot to evolve, so I’ve published the project in Launchpad to make contributions easier. I actually apologize for not publishing it earlier. There was hope that more features would be implemented before releasing, but now it’s clear that it won’t get many improvements from me anytime soon. If you do decide to improve it, please try to prepare patches which are mostly ready for integration, including full testing, since I can’t dedicate much time for it myself in the foreseeable future.

Read more