Tuesday, November 18, 2014

Papa Parse 4, the fastest CSV parser for the browser

Papa Parse 4 is here.

This version sports a complete rewrite of the core parser, making it about 10 times faster in general, and up to 270 times faster in special cases. The speed multiplier is inversely correlated to the input size, but the new algorithm is still faster all around.

It's the fastest for its feature set of all the parsers I compared. I've run hundreds of benchmarks locally, but you can try running some benchmarks yourself on jsPerf.

I rewrote Papa Parse almost 6 times and tried about a dozen total different ways to parse CSV. I won't spend time going into detail about each algorithm. But here's a summary of how the old parser worked, then a few ways I tried that either didn't work, or worked but was too slow and complex.

Various (slow) ways of parsing CSV

The old parser iterated on each character. Though character iteration allowed us to keep a very granular amount of state as we went, it was slow, plain and simple.

So I figured that splitting on newlines and delimiters may not be a bad idea after all. Sure, quoted fields and commented lines (not part of the standard) throw things off, but we could iterate through and "repair" the broken fields if we found any. The nested loop led to too many iterations, and moving things around within arrays when making repairs was too slow.

What about doing the same thing, but starting with an empty array and appending each field, one at a time, then each row to the array, instead of splicing existing arrays? Still a nested loop; still too slow.
I tried reversing the split order: first split on delimiters, then on newlines. I didn't expect much, and sure enough, it wasn't faster.

Then I tried splitting on delimiters, and using substring to detect newlines and append rows to the final result set. This was actually working pretty well, until... quoted fields, especially quoted fields that end a line. Things got complicated quickly.

The opposite occurred to me, where I split on newlines, then use substring to extract fields between delimiters? Still a big fat nested for loop. Performance tanked before I could get the algorithm correct for all the edge cases.

Then a novel idea: split on quotes??  Hey, that's neat: no more needing to look over my shoulder for those pesky quoted fields and skipping opening and closing quotes. Then I could just use indexOf() and substring() between delimiters and newlines without any trouble. It was a weird way of thinking, since an element in the array that ends with delimiter or newline indicates a quoted field just began. Once inside a quoted field, if I encountered two consecutive, empty strings, I knew the data had a single quotes character. Strange, but plausible. The more quotes were in the data, the slower it performed. I couldn't win! And what if the data didn't have quotes? Then it was just the plain substring extraction between newlines and delimiters anyway. Wait... that could work...

To heck with split(). Even though it's a super-fast JavaScript native, the repairs were too hard. The inner loops were too slow. But that last way of splitting on quotes showed me that using substring() was pretty easy and still fast enough.

This turned out to be the fastest, simplest algorithm yet. It uses a single tight loop without any nested loops except for -- you guessed it -- handling quoted fields. There is no splicing of array elements and no string splitting.

The new algorithm

The base algorithm is fairly simple, except for when you hit a quoted field:
  1. Start with cursor = 0, row = [], results = [], nextDelim = pos of first delimiter, and nextNewline = pos of first line break
  2. loop:
    1. if input[cursor] == quote:
      1. find the closing quote using indexOf (tricky)
      2. push input.substring(cursor+1, closingQuotePos) into row, replacing "" with "
      3. advance cursor and handle end of field, line, or file as needed
      4. nextDelim = index of next delimiter, and nextNewline = index of next line break;
    2. else if nextDelim != EOF && nextDelim < nextNewline:
      1. push input.substring(cursor, nextDelim) into row
      2. advance cursor
      3. nextDelim = index of next delimiter;
    3. else if nextNewline != EOF:
      1. push input.substring(cursor, nextNewline) into row
      2. push row into results, then clear row
      3. advance cursor
      4. nextNewline = index of next newline;
    4. else: break;
  3. push input.substring(cursor, EOF) into row
  4. push row into results
  5. advance cursor
  6. return results;
Basically you just hop along between delimiters and newlines, whichever comes first, and copy substrings into the results array. Quoted fields are an exception, but they are not as complex as you'd think. (Papa Parse also handles commented lines and a few other features but the performance impact is negligible.)

The hard logic is finding the correct closing quote, because you could theoretically have a field that looks like """,""" (a delimiter with escaped quotes on either side, in a quoted field). In context, it might look like ...",""",""","... and so to make sure we do it right, there's a few fun if statements.

As for speed, I think the slowest part, other than pushing the data, is replacing escaped quotes "" with a single quote " character.

As far as I can tell, the basic algorithm is correct (over 80 tests prove it, with more on the way). Feel free to leave a pull request if you'd like to improve on it for either speed or correctness! Just make sure the tests pass and the benchmarks are good.

Benchmarks are helpful, but be wary

Now's a good time to remind that benchmarks are not scripture and should be taken with some skepticism, since most benchmarks demand more elaboration. For example, CSV parsers behave very differently on different types of input. Input with quoted fields will be slower, and input with wide rows or short rows also differ. Some parsers are faster with small input, but another parser that's slower with small input may slow down slower as input grows (which is usually more important).

From what I can see, Papa Parse slows down the slowest.

Fast Mode

Papa Parse 4 has a new "fastMode" setting which, when enabled, can parse the input even faster. By enabling fast mode, however, you assert that the input does not contain quoted fields. All it's doing is String.split(), but the nice thing is that all the other Papa Parse features still work. The only requirement is that the input doesn't have quoted fields. If it does, the quotes will not be stripped, and the results may be incorrect if any field contains delimiters, newlines, or escaped quotes. You can see the difference in these benchmarks.

The 1 GB Test

In my testing, Papa Parse can now parse a ~1 GB tab-delimited file in ~45 seconds in Chrome, and ~30 seconds in Firefox, a vast improvement to the previous 3 minutes in Chrome and 75 seconds in Firefox. And with fast mode, the same file took only about 19.48 seconds in Firefox:

Change log

Without further ado, Papa Parse 4 is a new major version. Along with several bug fixes, note these changes:
  • Handling line breaks is no longer the job of the core parser. The line break sequence (\n, \r, or \r\n) must be specified by the caller or it will be auto-detected by Papa.parse() before parsing actually begins.
  • No more ParseAbort or UnexpectedQuotes errors. It's expected that if you, the user, abort the process, you know about it, and don't want to be bothered handling another error that tells you what you already know. Also, UnexpectedQuotes is more like a warning, as it does not interfere with parsing (except slowing it down). As such, both these errors are no longer generated.
  • New fastMode setting. Enable it for way faster parsing, but only if the input doesn't have quoted fields.
  • MissingQuotes used to have the line number as a reference point. Counting line breaks slows down parsing considerably because line breaks my occur in quoted fields, so this is no longer available. However, the row index and character position within the input are still there.
  • The keepEmptyRows setting is replaced with skipEmptyLines. The default value is false, so the default behavior is that empty lines are retained in the output, which is different from previous versions. This is important for single-column CSV files that have empty fields. This also means that fancy behavior like stripping empty rows now has to be explicitly turned on, whereas before Papa was lenient with empty lines and whitespace and would take them out for you without you asking. (This was usually bad!)
  • Lines with just white space are no longer considered empty and the white space will be included in the output.
  • Comments can now start with a certain string of arbitrary length, rather than a single character. For example, comment lines can start with "//" or "=N(" instead of just "#".
  • The chunk callback is now passed three arguments: results, streamer, and if streaming a local file, the File object. The streamer has pause and resume functions exposed so you can pause parsing when receiving chunked results. Before, you could only pause parsing with the step callback.
Anyway, hope you like it. Feel free to give a shout if you're using Papa Parse, I'd love to know! (You can also donate if you want to; it's nice to be able to get a free lunch once in a while.)

Monday, September 1, 2014

Papa Parse 3.1 released

Papa Parse 3.1 is now available. It has some edge-case bug fixes and also a few new features and enhancements.

Official release here on GitHub

Change Log

  • New config property: chunk. Specifying a chunk callback function still streams the input under the hood, but instead of getting the results row-by-row, you get the results a chunk at a time. This can still use more memory than specifying a step callback, but for very large files (say a million rows), invoking a function 10,000 times may be faster than invoking a function 1,000,000 times. Remember the chunk size is customizable.

  • New config property: keepEmptyRows. Sometimes it's important to maintain row parity with the input, for example, when you want some output to match row-for-row with the input. Rows that are empty are skipped and it can be impossible to know where empty lines were otherwise.

  • New config property: error. Actually, this appeared in the source code already but wasn't documented nor was it used where it should have been. The error callback gets executed if FileReader encounters a problem opening the file.

  • New feature: Pause and resume parsing. This only works without a worker thread. The step callback is passed a ParserHandle which has pause and resume functions exposed. This makes the parse function asynchronous if it wasn't already, but allows you to do other asynchronous things during parsing, for example, AJAX requests.

  • Enhancement: Meta has a new property called truncated. This value will be true if using preview and the preview did not parse all the results because they were longer than the preview length. It will be false if the preview was able to parse the entire input.

  • Fix: The preview value included the header row; in other words, a preview of 10 would produce 9 rows if the first row was a header row. Now, a preview of 10 parses 11 rows so that 10 are data rows.

  • New option: Papa.DefaultDelimiter. The default is comma, but specify a different (and valid) delimiter in the cases when a delimiter is not specified and detection fails.

  • Firefox fix: FileReader in worker thread. Firefox does not (currently) have FileReader in worker contexts. But it looks like they've considered that a bug and will be rolling it out in future versions of Firefox.

  • Fix: When converting JSON to CSV, null value is interpreted as empty string, whereas before an exception was thrown.

  • Fix: Using complete callback with a worker thread.

  • Fix: Escaped quotes in quoted field edge case.

Friday, August 29, 2014

Chrome rendering quirk when using 3D transforms

In Chrome 36.0.1985.143 I encountered a fascinating rendering quirk related to some elements that had a 3D transform and perspective. Firefox didn't exhibit the behavior. Baffled at first, I finally figured out what Chrome is doing.

Check out this layout where I've reproduced the problem. Try clicking the yellow box:

Notice that, even though the blue content box appears on top of the gray rectangle, you can only click on the yellow box over the word "me" where the gray box has ended.

Firefox lets you click the whole box as expected; but in Chrome, only a part.

I tried shifting the gray box by changing the left CSS property,wherever the gray box was, I couldn't click. ... until the gray box was shifted far over to the right:

Odd. Then I found a shift where I could only click half of it again (over the word "Click"):

Notice that you can only click the other side of the yellow box now, even though the gray box is entirely "behind" the yellow one.

So what's happening?

Chrome is rendering the z-index properly, but z-index does not apply to 3D transforms. The gray box was rotated on the Y axis so that, on the left, it's further away from the user on the Z axis, but on the right, it is closer to the user -- so close, in fact, that it is cutting through the yellow box, even though it does not appear this way visually.

This could be argued as an edge case, but it is a very real concern for web developers using 3D transforms in their layouts. The work-around is to use another 3D transform, translateZ, instead:

By translating the content on the Z axis toward the user enough so that it appears in front of even the gray element's right side, the entire content area is selectable and clickable again.

To see this even more clearly, try changing the translateZ value on #content to 90px instead of 100px in that last fiddle, and you'll see that only a small part of the yellow box is not clickable. That part is still considered as "behind" the gray box in Chrome's surface map.

I hope that made sense. This had me scratching my head for a few minutes. Firefox exhibited the behavior I expected, but Chrome was technically more correct in rendering the surface map in a 3D space. Now I'm not sure what to think.

Tuesday, August 5, 2014

Maximizing the Use of Interfaces in Go

Nerd alert: I made a type that satisfies every almost every Go standard library interface, just 'cause. And I noticed some interesting things about the standard lib along the way...

But in case you're in a hurry, I'll cut to the chase: Go programmers, as a community, should seek to minimize the declarations of new interfaces and improve on and maximize the use of existing ones.

Favorite thing about Go: can you guess?

Interfaces are my favorite thing about Go. (I share Russ Cox's sentiment for them.) Concurrency is cool too, but I am of the opinion that implicitly-implemented-interfaces trump even the mighty go keyword. (Seems that Dave Cheney thinks so too.)

To illustrate why, consider the tens of thousands of functions in Go code across the world that use an interface value of type error in its signature. All the functions that use error can automatically be compatible with your brand new type the world has never even seen before simply by adding an Error() string method to it.

Isn't that cool!? Two packages can work with each other without having to know anything about each other: the types just happen to have methods defined on them which implement a common interface. This has huge implications for how we, as a community, design software.

It gets better. When declaring your own interface, you can take an existing interface and embed it into yours, like io.ReadWriter does:

type ReadWriter interface {

Now any types that implement io.ReadWriter also implement io.Reader and io.Writer automatically. It's a 3 for 1 deal. Wow!

My interface addiction

A while ago, I self-diagnosed myself with Compulsive Interface Implementation Disorder (CIID). I was writing custom error types, first for the martini-contrib/binding package, then again for the more general mholt/binding. I satisfied my craving by satisfying the error interface. Then suddenly I realized what I'd done: I'd just given new life to thousands of existing libraries. All the code which requires an error could now use my new type.

Of course this isn't the first time I've used interfaces in Go before, and most likely I've been under-utilizing them. There was something so liberating about making this new type compatible with nearly everything else out there.

A look at interfaces in the standard library

As addicts do, I sought more of a high. So I wrote a little program that finds all the interfaces from the Go 1.3 standard library. Turns out there are 114 standard library interfaces.

Here's a few things I noticed while glossing over them.

There's plenty of overlap.

I asked the folks on Gophers Slack why several interface declarations contained duplicate methods: same names, same signatures. I found the following interfaces (there are probably others) with duplicate method definitions:

   Close() error

ReadRune() (r rune, size int, err error)

UnreadRune() error

ReadByte() (c byte, err error)

Write(p []byte) (n int, err error)

Seek(offset int64, whence int) (int64, error)

String() string

I am curious if this duplication is intentional, because each of these methods need only be declared once. The other interfaces could then just embed that interface where the desired method is declared, like we saw with io.ReadWriter. (Note that by "duplication" I mean the method name & signature, not its actual implementation.)

Some participants in Gophers Slack suggested that the duplication was for stability: if an embedded interface was changed, it would probably break code that used the embedder interface. Consider this frailty when designing your own packages.

In a comment, Jan Mercl pointed out that embedding a type (say fmt.Stringer) requires that package to import fmt, even if it doesn't use it for anything else. This should be taken into consideration when embedding.

Also, using embedding implies a working relationship with the embedded type. Simply sharing the method signature is not necessarily enough reason to embed. Even if an interface you made for your own http package happens to have the same methods as another interface in some obscure package like syscall doesn't mean you should import syscall and embed the type.

Most standard lib interfaces are small.

I seldom saw interface declarations in the standard lib with more than 3 or 4 methods. This pattern is beneficial to us.

When two interfaces share a common method, it might be wise to pull that method declaration out into its own interface. Small interfaces are easy to implement and embed to make larger interfaces when needed.

Effective Go suggests that small interfaces should be common.

reflect.Type is a HUGE interface.

Speaking of keeping interfaces small, did you know that the interface reflect.Type has 28 exported methods? I suppose these are for implementing run-time reflection. Large interfaces are exceptionally time-consuming to implement, so avoid them when possible. If you do declare a large interface, take advantage of embedding when you can.

Some interfaces look like others.

The expvar.Var interface acts just like the fmt.Stringer interface because they both have exactly one method and its signature is the same: String() string. It might seem annoying that there are two different interfaces with the same definition.

Which one should you implement? Actually, you don't have to choose. In Go, implementing one means you have implemented the other, since they are the same by definition.

Even though these two interfaces seem redundant, they document code differently. Using expvar.Var makes it clear that you're working with an exported variable, not just some generic type that can stringify itself. In other words, this "redundant" interface is self-documenting, and it provides extra cognitive context. You might furrow a brow if you saw a function use fmt.Stringer in the expvar package in a place where it seems more natural to use a Var.

If you follow this pattern in your own projects, be sure it adds clarity to what your code is doing, and use godoc to help explain its purpose.

Some interfaces almost look like others.

In Go, it's so easy to implement interfaces that it can be accidental.

os.Signal caught my attention because of the comment in its definition:

type Signal interface {
        String() string
        Signal() // to distinguish from other Stringers

It looks like fmt.Stringer, except for that pesky Signal() method. This pattern is actually similar to the one above with expvar.Var, but is slightly more rigorous.

What is the purpose of Signal(), which in practice which does nothing? I did a little digging in the os and syscall packages. Lots of low-level stuff happens in there. For your type to work with the functions that use an os.Signal value, you need to deliberately assert that your type is, in fact, an os.Signal, not a mere fmt.Stringer.

It's a clever trick. Use it sparingly though, as it's seldom necessary to prevent others from accidentally implementing your interface.

Interface implementations are documented as such.

Nothing fancy here: just write a comment when you deliberately implement an interface, like the Go authors did with syscall.Signal. This makes it obvious where and how you can use the type.

Just for fun: a supermassive type

I implemented as many standard lib interfaces as I could on a single type. The Everything type implements 103 of the 114 Go standard library interfaces. You can use it in ~90% of the standard library functions that have an interface in the signature.

Interface power

Go programmers should use interfaces often. Design for them, and define them carefully.

We should be talking about specific interfaces as much as we talk about packages and projects. Announce interfaces that may be useful to others.

I hope that it becomes common for a list of interfaces implemented by a package to be promoted to the project's README. Implementing popular interfaces should be described as a feature!

As we find and use existing, exported interfaces, our libraries will become more interoperative and compatible at an exponential rate. (Incidentally, a way to search exported Go interfaces would be awesome. SourceGraph? Anyone?)

Go programmers, as a community, should try to minimize the declarations of new interfaces and maximize the use of existing ones.

Monday, August 4, 2014

Convert video to animated gif with ffmpeg

Thanks to the help in this SuperUser question, I'm posting some commands for future reference that make it easy to convert video or movie files to animated GIF.

This requires ffmpeg (and image-magick, if you want better quality). If you have a Mac and need those programs, you can do brew install ffmpeg imagemagick to install them.

NOTE: To convert videos to a more HTML5-compatible format like WebM, install ffmpeg with the --with-libvpx --with-libvorbis --with-fdk-aacc options.

Basic conversion:

ffmpeg -i input.mp4 -r 20 output.gif

-r 20 means 20 frames per second, which is pretty common; 20 FPS looks smooth to the human eye.

Better quality:

ffmpeg -i input.mp4 -r 20 -f image2pipe -vcodec ppm - | convert -delay 5 - output.gif

-f image2pipe tells ffmpeg to output the images through the stdout pipe. -vcodec ppm indicates that the codec used should be for the PPM format (a kind of plain-text representation of images using RGB color values). The -delay 5 means that convert should insert a new frame every 5 ticks. There are 100 ticks in a second, so this equates to 20 FPS. By disproportioning the -r and -delay values, you can make your gif faster or slower.

Better quality at smaller file size:

ffmpeg -i input.mp4 -r 20 -f image2pipe -vcodec ppm - | convert -delay 5 - gif:- | convert -layers Optimize - output.gif

Here we run it through convert twice to optimize the different layers. This should reduce the file size a little bit.

But after all this, I discovered that just encoding the original videos using Handbrake with "Web Optimized" checked and moving up the RF slider to a higher value (like 22) was significantly better. A nice-looking gif was about 10 MB but a nice-looking, re-encoded video from Handbrake was only about 200 KB. Go figure.

Bonus tips: resizing images

Image Magick can also resize images on-the-fly and en masse.

One image:

convert input.png -resize 400x400 resized.png

Many images:

mogrify -resize 400x400 *.png

Note that, by default, convert and mogrify will resize the image to fit into those dimensions, keeping the aspect ratio. They will not distort the image. In other words, the -resize value is the maximum dimensions for either side (WxH).

Also, mogrify will overwrite original files. However, the latest versions generate backup files with a tilde (~) at the end of the file name. If you want to prevent this, specify a sub-directory to put the results into:

mogrify -path resized_images -resize 400x400 *.png

(I noticed a more pixellated quality on my Retina MBP when using the -path option. I don't know why. Also, the path must exist beforehand.)

Yay for not doing this all manually!

Extra bonus: convert MP4 to WebM

Thanks to this guide, it's pretty easy to convert MP4 to WebM:

ffmpeg -i input.mp4 -c:v libvpx -crf 40 -b:v 1M -c:a libvorbis output.webm

The CRF value can be anywhere from 4-63 where a lower value is higher quality. They recommend to start at 10, but if your goal is small file size, try a value around 40 and go from there. I found 40 to produce a small video that was still good enough for the web.

Thursday, July 31, 2014

Javascript Performance: Synchronous vs. Asynchronous FileReader

I learned something surprising about Javascript that didn't intuitively transfer over from other languages.

HTML5's FileReader API lets you read files from disk. The user must choose the file(s), but then you have permission to load them in text or binary format. If you've used it before, you know that it's asynchronous, kind of like XMLHttpRequest: you set up a FileReader with callback functions before you perform a read:

reader = new FileReader();
reader.onload = function(event) {
// ... event.target.result has file contents ...
I've heard it said that asynchronous operations in Javascript run slower than synchronous ones. I assume this has to do with processor scheduling, the overhead of a function call, etc. Asynchronous can be a little cumbersome to use, too, and it can lend itself to spaghetti code.

But did you know that there is a synchronous FileReader? No callbacks necessary:
var contents = new FileReaderSync().readAsText(file);

Nice, huh! There's one catch, though: it's only available in web workers. This is because reading a file synchronously is a blocking operation and would lock up the web page. Therefore, you can only use FileReaderSync in a separate thread.

When I rewrote Papa Parse for version 3.0 which would support running in a web worker, I was excited to use this new FileReaderSync thing. One cool feature about Papa is that it can "stream" large files by loading them in pieces so that they fit into memory. Both FileReader APIs support this "chunking" mechanism, so I took advantage of it.

Finally, it was ready to test. I executed it and roughly measured its performance. My first test was in the main thread using FileReader (asynchronous):

That looked right. It took about 11 seconds to process just under five million rows. I was happy with the results.

Then I tried it in the worker thread using FileReaderSync. Like its asynchronous sister code, this routine was periodically sending updates to the console so I could keep an eye on it. Rather abruptly, updates stopped coming. I had to resort to system-level monitors to ensure my thread hadn't locked up or died. It was still there, but its CPU usage dropped significantly and updates slowed to a crawl:

It took nearly 20x longer to process the file in a dedicated worker thread using a synchronous file reader! What!?

The jagged line was expected: the results had to be sent to the main thread after every chunk for reporting. (In Javascript, data has to be copied between threads, rather than sharing memory. D'oh.)

Isn't it odd that its speed sharply, steadily declines at exactly 30 seconds?

Here's the two graphs on the same plane:
Obviously, this was unacceptable. What was going on?

I asked the folks on Stack Overflow and Google+, and guess where I got the answer? Google+, believe it or not. Tyler Ault suggested trying the regular, asynchronous FileReader in the worker thread.

I did and it worked.

Performance in the worker then was comparable to the main thread. No slowdowns. Why?

I don't know for sure. A couple theories were discussed:

  • The garbage collector only runs when the CPU is idle on that Javascript thread. Since it was all synchronous, there wasn't any time to breathe and clean things up. Memory had to be shifted around a lot and maybe even swapped out (though I never expended my system memory; maybe the browser would put a cap on it).
  • The fact that it slowed at exactly 30 seconds is interesting. It's possible the browser was doing some throttling of threads that ran for that long without a pause. (Though that seems to defeat the purpose of using worker threads.)
If you have more light to share on the topic, please feel free to comment (and +mention me to be sure I get it) or tell me on Twitter @mholt6 or something.

Monday, July 28, 2014

GoConvey for Gophers

GoConvey is a testing utility for Go built around the standard testing package and go tools. If you're not familiar with GoConvey yet, take a look, maybe try it out, or watch the intro video before finishing this article.

Mike (@mdwhatcott) and I (@mholt6) made GoConvey to improve our workflow and to help us write clearer and more deliberate tests.

I wanted to point out a few things that may not be clear or obvious about GoConvey. Then I'll discuss why it is or isn't a good choice for you and your project.

"GoConvey" means a few things.

In your editor, it's a testing framework (namely the Convey and So functions).

In the terminal, goconvey is an HTTP server that auto-runs tests.

At http://localhost:8080, it's a graphical interface for using go test and viewing results.

In your compiled binaries, it's nothing. (More on this later.)

GoConvey is loosely coupled.

That means you can use the DSL without the goconvey server. And vice-versa.

Many developers use the web UI to auto-run and visualize standard Go tests that don't use GoConvey's special functions. GoConvey can run tests built with other testing frameworks as long as they also use go test.

By using GoConvey, you're not automatically building another barrier to entry: you're automatically building and running Go tests.

On the flip-side, it's totally normal to just keep running go test from the command line manually while using GoConvey's syntax to structure tests. (If that's what you prefer.)

GoConvey runs go test.

Look! Nothing up my sleeve. All goconvey really does is run go test. Yeah, it uses the -cover flag and stuff, but all the output and data it generates comes from the go command.

go test runs GoConvey.

GoConvey tests are in _test.go files and inside Test functions with the testing.T type, just like regular tests.

It's a good idea to use regular tests when you don't want the BDD structure or when it becomes unnatural to describe your tests in the BDD style. You can use both kinds of tests in the same package and file.

GoConvey is not a CI tool any more than go test is.

Sometimes I see people refer to GoConvey as a CI (continuous integration) utility. Maybe they define that differently than I do, but the only thing "continuous" about GoConvey is the auto-test feature and the only "integration" features it has are with go test (and maybe with the file system, if you count polling the project directory for changes).

If it works for you in a CI environment, great! (Would you tell me how?) You can write tests using the DSL and run those in CI jobs, but beyond that, the web UI is interactive and I don't think the goconvey server that powers it is useful in automated scripts...

Test dependencies are not compiled into your programs.

Except for running go test, your Go programs are not built with test files and dependencies. This means that using a testing framework like GoConvey has no technical effects on your compiled program. The file size and function of your resulting binary is unchanged.

Even go-getting any package does not download its test dependencies. You need the -t flag for that.

Test dependencies are thus very deliberately obtained and are only needed by your library's contributors/hackers. Most people will never need to concern themselves with your testing framework or assertions package of choice.

GoConvey documents code beyond godoc.

Godoc is Go's convenient, built-in documentation system. It reads comments directly inline with your code to produce package, function, and variable documentation; examples; known bugs; and more.

Far from replacing godoc, GoConvey's DSL complements it. While godoc documents how to use a package, Convey tests document and enforce package functionality and behavior.

For example, suppose you wrote a bowling game. The godoc would probably explain how to use functions like NewGame, Score, and Roll, along with what they do. This is useful for users of the package, but leaves a lot to be assumed by any developers that want to start hacking away on it. To fix this, you could make your godoc much more verbose and describe how the package is supposed to behave, but that's just noise for users who don't need to know about the internals, and developers still don't have any assertions that prove your program works like you say. You could write regular tests, but then you hope the comments stay true to the code.

This is where behavioral tests come in. The GoConvey tests for the bowling game make it clear exactly what the program should do normally and in edge cases; plus, it asserts correct behavior in context of the test case. The tests actually convey intent and become that missing documentation.

See, by using godoc along with descriptive tests, you've now sufficiently documented your code for both users and developers.

Deciding when and how to use GoConvey

Should everyone use GoConvey for everything?


GoConvey is all about workflow and intent. It's not for everyone in every situation. I don't even use it all the time.

It does some things very well, but other things not so much.

Does well:

  • Documents your code with descriptive, structured tests
  • Auto-runs tests
  • Assertions
  • Reports results in real-time in the browser or via desktop notifications
  • Integrates with go test and standard testing package
  • Stubs out test code for you

Doesn't do well:

  • Automated environments (regular go test works fine though)
  • Run outside of $GOPATH
  • Save battery life
  • Race detection
  • Truly randomized test execution order
  • Idiomatic Go (I'm referring to the DSL)

There’s no magic formula that I know of to determine whether you should use GoConvey one way or another. But here are a few ideas to try:

  • Execute the goconvey command and open the web UI on your existing Go tests. See how it feels.
  • If you are starting a new project, check out the examples folder and try a new way of writing tests using GoConvey's DSL (careful, you might like it). Make sure the GoConvey server is running so you get instant feedback after every save. (Or take a look at the tests for Go-AWS-Auth for a real use case.)
  • If you don't like the nested BDD structure, try using a single level of Convey instead. This way your tests can still be somewhat descriptive and you get the benefit of the many So assertions. (For example, you can still run table-driven tests using a for loop inside a single Convey level.)
  • Use Go's dot-import for the convey package so you can make it "one of your own," so to speak. Way more convenient.

One of our hopes for GoConvey is to make writing tests more of a pleasure than any old task on your to-do list. Watching the coverage bars go up, up, up and seeing that orange "FAIL" turn into a big green "PASS" is seriously motivating.

(okay, really; done with this meme now)

Tuesday, July 15, 2014

Papa Parse 3.0 is here, and it's boss

After months of re-writing Papa Parse from the ground up, version 3.0 is finally here. In case you don't know, Papa Parse is a CSV library for Javascript. With it, you can parse CSV files or strings into JSON, and convert JSON back to CSV.

(Nerd alert: Today was also the day of csv,conf in Berlin. Isn't that awesome? I had no idea.)

Papa Parse 3 is a major, breaking change. Here's a quick look at what's new.

First, a quick warning: 3.0 is a breaking change. It's not a drop-in replacement for 2.1... you will break your app! The API is slightly different, and the results are structured differently. Read on for more information.

New results structure

Previously, parse results were returned as "results", "errors", and "meta", and "results" contained the parsed data, or if using a header row, "fields" and "rows". This was confusing and led to awkward code that looked like results.results.rows[0][2] to access any data. The new structure of results is much more consistent and intuitive:

{ data: // array of parse results errors: // array of errors meta: // object with extra info }

The "data" property only ever contains the parsed data as an array, where each element in the array represents a row. In the case of a header row, the "fields" have been moved into "meta" where they belong.

Header and dynamic typing are off by default

In Papa Parse 2.x, the header and dynamic typing were enabled by default. Now, the default is false/off, which is more intuitive. If you want any fanciness, you have to turn it on explicitly.

Eliminated the jQuery dependency

Papa Parse is now fully "Papa Parse" and not the "jQuery Parse Plugin" - the files and repository have been renamed to accommodate. Where before you would invoke $.parse(), now you simply call Papa.parse(). Much more elegant.

Technically, Papa Parse is still a jQuery plugin. If jQuery is defined, it still has the familiar $('input[type=file]').parse(...) binding that you may have used to parse local files. This interface has been improved and parsing files has never been easier.

Since Papa has been completely de-coupled from jQuery, it's easier to use in Node and on pages that don't have or want jQuery brought in.

Unparse - convert JSON to CSV

Papa's specialty is parsing CSV into JSON, or Javascript objects. But now it can export CSV too. It's easy to use:

var csv = Papa.unparse([ ["1-1", "1-2", "1-3"], ["2-1", "2-2", "2-3"] ]);
// 1-1,1-2,1-3
// 2-1,2-2,2-3

Here we passed in an array of arrays, but you could also pass in an array of objects. Even more settings are described in the documentation.

Run in web worker

Long-running scripts, like parsing large files or strings, can lock up the browser. No bueno. Papa Parse 3 can spawn a worker thread and delegate the heavy lifting away from your page. This means your page will stay reactive to mouse clicks, scrolling, etc, while heavy-duty parsing is taking place.

Web workers are actually kind of a pain in some sense, but Papa makes it easy. Just say worker: true:

Papa.parse(file, { worker: true,
complete: function(results, file) { ... }

Download and parse files over the Internet

Papa could parse files locally using FileReader for a while. But now it's easy to download remote files and parse them. This isn't hard even without Papa to do it for you, but the advantage here is that now you can stream the file. So if you have a large file, let's say, 200 MB, sitting on another machine, you can give Papa the URL and it will download the file in chunks and feed you the results row-by-row, rather than loading the whole thing into memory. Big win!

Papa.parse("/files/big.csv", { download: true,
step: function(data) { ... },
complete: function(results) { ... } });

Those are the most notable new features and changes in version 3.0. There's a bunch of other stuff under the hood, too, that you'll benefit from.

Now maybe get your feet wet with the demo page or visit it on GitHub.

Sunday, July 13, 2014

An AWS signing library for Go

Go-AWS-Auth is a comprehensive, lightweight AWS signing library for Go. Simply give it your http.Request and it will sign the request with the proper authentication for the service you're accessing.

Other libraries like goamz can be useful and convenient, but they do come with a cost: less flexibility and a larger code base. Though Go-AWS-Auth only does signing, it is a reliable and transparent way to interact with AWS from Go. And it works directly with your http.Request objects for any AWS service.

Now making requests to AWS with Go is extremely easy:
url := "https://iam.amazonaws.com/?Action=ListRoles&Version=2010-05-08"
client := new(http.Client)

req, err := http.NewRequest("GET", url, nil)


resp, err := client.Do(req)
The library is thread-safe and supports the following authentication mechanisms:
Feel free to use it and contribute if you find ways to improve it!

Monday, May 12, 2014

Code for Javascript Web Workers in same file as main script

Javascript's Web Workers (actually HTML 5's) have a lot of potential and they're surprisingly safe to use. They're safe because we sacrifice one of the main benefits of threading: sharing memory. Values between the main thread and worker threads are copied, not pointed to, so you have that overhead to deal with. (You send messages between threads rather than sharing memory.)

Anyway, in JS libraries especially, it can be useful to spawn worker threads, but it's inconvenient to have those functions in entirely separate files sometimes. I'm not 100% sure of the implications yet, but I had some success with the following template:

Template for worker script in same file as main script;
adapted from http://stackoverflow.com/a/10136565/1048862
to make it a little cleaner
  "use strict";

  var isWorker = !global.document;

if (isWorker)

// Entry point for when we're a worker thread
function WorkerThread()
global.onmessage = messageReceived;

// Worker thread execution goes here...
// This is how to pass messages back to the main/parent script
// global.postMessage("From worker!");
// (messages can also be objects)

function messageReceived(e)
console.log("Worker received message from parent:");

// Entry point for the main thread
function MainThread()
var SCRIPT_PATH = getScriptPath();

// Main script execution goes here...
// This is how to start new workers and pass messages:
// var w = newWorker();
// w.postMessage("From parent!");

function getScriptPath()
var id = "worker" + String(Math.random()).substr(2);
document.write('<script id="'+id+'"></script>');
return document.getElementById(id).previousSibling.src;

function newWorker()
var w = new global.Worker(SCRIPT_PATH);
w.onmessage = messageReceived;
return w;

function messageReceived(e)
console.log("Parent received message from worker: ", e.data);


Anyway, I think I might use this again.

Monday, May 5, 2014

Scribbles from GopherCon 2014

Just a few notes I jotted down when I wasn't too mesmerized by the presentations... they probably won't make sense unless you were there or can see the talks. I'm mostly posting these for my own benefit. Anyway...

High Performance Go

  • Use go tool pprof (linux)
  • Avoid short-lived objects on the heap
  • Use the stack or make long-lived objects
  • Benchmark standard library builtins (strconv)
  • Benchmark builtins (defer, hashmap)
  • Don't use channels in performance-critical paths

CSP (Communicating sequential processes)

Unbuffered channels are nice because there's no ambiguity: you know that both sides are communicating right now, and are synchronizing.

Nil channels block; they can be used in selects to know when multiple channels are closed

Bluetooth Low Energy (BLE) and Embedded Go

If you're worried about GC pauses, think about the number of allocs as much as the size of your allocs.


You don't need to have a big problem to solve to use Go.

Heka (on streaming data)

Basic pattern: read in data from a stream, split at record boundaries, transcode the records to a common format, route to the appropriate service in its own format.

Bind a single struct to a goroutine (or a single goroutine to a struct).

High-performance database in Go

(From basic OS classes) How to think about performance: optimize the following from least optimization efforts to most: Memory access, mutual exclusion, memory allocation, disk I/O, network I/O

MongoDB and Go

Rethink using Rethink.

But maybe rethink using Mongo, too. Choose the right tool for the job.


The DB type represents your database; it is not a connection.

Monday, March 24, 2014

An in-depth review of Google Fiber's free service, with pictures

We had Google Fiber installed a couple months ago (Feb 2014). It was activated on the 25th. Previously, we relied on Veracity which, unfortunately, shared the connection with the whole building of condos.

Problems without Google Fiber

There are a few network switches in our basement parking area which feed each apartment with ethernet cables. Those switches fail on occasion.

What's worse, and kind of hilarious, is that if one resident plugged their router in wrong (by inserting the WAN cable into a LAN port on the router), it brought down the Internet for the whole building, because those switches thought the router was the gateway to the WAN.

Fortunately, tracking them down isn't too hard because they still use the default username and password and I can see which devices are connected, and their computers often have their name.

The nice thing is that a dedicated fiber line to each apartment prevents this kind of chaos.

The installation

After signing up, they were very autonomous about installing. I had to do nothing, except when they finished. First, these guys show up at your house:

And they leave you a nice little note:

Then a few days later, a pull string appears. It's remarkably tiny (they later run the fiber cable and put a protective sleeve over it):

Then finally the jack is mounted (read below for more about this):

The router you get is real nice, but has a small fan (read below for more about this, too):

The back is elegant and simple:

The speed tests

Before Google Fiber

With Google Fiber

Since we're poor, our apartment is on the free plan. It's advertised as 5 Mpbs down reliably and consistently, but the upload isn't ever really mentioned. Here's why...

Fortunately, our apartment still has both lines active because the HOA hasn't turned off Veracity service. I can use either.

However: when others in my apartment are downloading something, I get lower speeds... I guess that's how it is, but it would be nice if it was 5 Mbps per client, not total.

They also have a special page, also powered by Ookla, for doing a speed test. Their page considers whether you're connected wirelessly or hard-wired:

But again, when I'm the only one using it, the 5 Mbps down is very fluid and reliable. The upload speed is disappointing, however. 1 Mbps seems a bit too throttled, even for free.

The wall jack

They install a wall jack which, kind of oddly, has to be powered by a power outlet. The nice thing is you get a power strip/surge protector for free, but it's kind of strange that a wall jack has to be plugged in for power. I guess they do this to speed up installations in bulk, which is why they can offer this for free (or cheap if your HOA doesn't pay it for you).

The light doubles as kind of a night-light, since it's so bright. Blue indicates normal operation, but it can also be red. I came home one night after everyone was gone for the day and it was flashing red. I pulled out my phone and loaded a page. After a few seconds, the light turned blue, but I had to reload the page on my phone before it came in successfully. I wonder if the unit hibernated with little or no activity.

They also provide a short CAT6 cable to plug into their router.

The router

Google gives you a sleek black router that supports 2.4 and 5 GHz b/g/n WiFi. It has a small fan, so it makes a quiet hum which you can hear if you're close. And of course, this hardware all supports IPv6.

So how do you manage this thing? Let's take a look.

Managing your Fiber network

By default, Google provides access to very basic management of your network box from anywhere, just by logging into "My Fiber" and clicking on "My Network". You there have the simple options of changing your network name and wireless password, or I think you can turn wireless off, too. That's about it. But it's cool that you can do that from anywhere, by default, without needing to know how to administer a router.

But, of course, the first thing I did was turn on Advanced Mode:

With this mode, you log in using a local IP address and it's very much like a typical router administration interface. Notice at the bottom of that screen they tell you the password so you can log in as admin.

A more advanced look at the admin interface

Their advanced interface isn't quite as pretty, but it's one of the most functional ones I've seen. 

The login is always admin, and a good password:

The landing page tells you some useful high-level info about the state of the network and its clients:
I have blurred out a few things in these screenshots in case they are potentially personally-identifiable (I err on the side of caution). And since my roommates share this network, I've blurred most of their info.

You can examine and manage the wireless radio:

You can see and manage how IP addresses are being assigned to devices. Notice there's also a Dynamic DNS option (not previewed here), which after I fiddled with is, it pretty modern and useful:

 Firewall settings include port forwarding (not pictured)...

 ... and connection details... holy moly, it's an IPv6 heaven:
There's also a summary of the system:
And of the WAN specifically (you'll see that the Fiber link is down; this is because we aren't subscribed to the gigabit plan):

Interesting and proper, I suppose, that the gigabit adapter is a different piece of hardware. It gets its own MAC address.

Conclusion: Yay or Nay?

Politics and business preferences aside, Google Fiber has been a good experience so far. Their support staff is great, their service centers are helpful and educational, their Fiberspace (demo/retail store) is fun, it comes with IPv6 support, and what can I say -- I have free Internet.

Overall perceived download times are about as fast as they were before, maybe slightly faster. But the connection is definitely more reliable. Yes, it is slower when my roommates use it: multiple users have a more terse, adverse effect on speed. But even when the speed is slow, I can usually stream videos without delay, and in HD. (There's exceptions to this regularly, but not nearly as severe or common as it was before.)

So yes, overall I think it is an improvement. Surely the gigabit plan is great, too... though, a note: you have to plug into the router to get that whole gigabit speed, and you have to be using a CAT6 cable, and your computer has to be able to support the throughput. The wireless-N band, if you choose to use that, will still be blazing fast (about 250-350 Mbps), but you won't get true gigabit if you stay wireless.

It was a big deal when Google Fiber came to town. Provo is just eating this up. There are flags and banners on the street lights downtown, and the mayor, city council, and other organizations have been superb in working with them. (Except BYU, but they don't need Google Fiber anyway. On campus I get about a 40 Mbps connection, even with hundreds of students around.)