Thursday, July 3, 2008

Perl: Handy, but Ugly

In what will probably be a many-part series, here's an oddity of Perl that had me tearing out my hair for a couple of hours...

If you know Perl well, feel free to skip this paragraph. Perl has a handy but ugly notion of context. Specifically, code execute in either a scalar or a list context: if a single value is expected, the code executes in scalar context; if a list of values is expected, it executes in list context (that's vague, but good enough for now). Then, code behaves differently depending on the context.

One example of context is getting the length of a list. Given a list @foo = ('a', 'b', 'c'), then @foo in scalar context is the length of @foo. Thus, $x = @foo sets the single value $x to 3 (the code executes in scalar context because $x is a single value, so Perl expects a single value assigned to it).

Now for a pop quiz. If @foo = ('a', 'b', 'c'); $x = @foo sets $x to 3, what does $x = ('a', 'b', 'c') do? Turns out it sets $x to c. Fascinating, isn't it?

The reason is that the comma does different things in list and scalar contexts. In a list context, comma is the list building operator. Thus, ('a', 'b', 'c') in list context (such as when assigned to the list variable @foo) returns a list with three items. However, in scalar context, comma is like C's comma: it executes both its left and right operands, then returns the result of the right. For instance, 'a', 'b' returns b, and 'a', 'b', 'c' returns c. Thus, when we assign ('a', 'b', 'c') to a single value, the code executes in scalar context, returning c.

Of course, I wasn't lucky enough to have this bite me in such a simple form. Instead, consider this (still heavily simplified) example:

sub foo {
    $a = "hello";
    $b = "world";
    return ($a, $b);
}

print join(" ", foo()) . "\n";
print scalar(foo()) . "\n";
I naively thought this would print hello world then 2. Instead, we get hello world then world. Today's lesson, then: when returning lists from functions, assign them to a list variable first.

Read more...

Thursday, June 26, 2008

Project Note Taking System

A while ago, I went looking for a good note taking system. Notes, as in on paper. I work on a lot of projects, and since I grok things better when I write them down, I needed a way to organize ideas, meeting minutes, tasks, and progress.

I found several hacks to turn my preferred notebook, a Moleskine into a full-fledged PDA replacement using GTD. However, I didn't want a PDA replacement. I wanted a simple way to organize project ideas.

I also found a lot of good note-taking systems. Of these, the Cornell system was closest to what I wanted. I liked the idea of taking notes, then adding higher-level comments off to one side. Unfortunately, page division doesn't work well in a small notebook, and the system isn't very project-oriented.

Thus, after some trial and error, I've mostly settled on something that works well for me. I begin with a large, graph paper Moleskine, though any notebook should work. Next, I take notes on the right-hand page, then write higher-level comments on the left page. That's the gist. The fun part is the details.

On the right-hand page, I always first write the date in the upper-right-hand corner. This makes finding old notes a lot easier. After that, I take notes however I like — outlines, drawings, mindmaps, whatever.

Then, both while writing notes and when reviewing them, I write higher-level comments on the left page. I find it useful to vertically align them with the part of the notes they comment on. Each comment is labeled, in the form Label: comment, so that I can immediately tell what kind of comment it is. I use five labels:

Topic
Every left-hand page has a single Topic comment first thing on the page. It's short phrases or keywords to remind me what the notes are about. By using only one per page and putting it at the top, it's easy to flip through the notebook and find notes about particular topics.
Thought
These are interesting thoughts about the notes, such as summarizations, ideas, etc.
Tip
Often my notes include good lessons, so Tips are things I want to do differently in the future.
Task
These are things that I need to do based on the notes. As I do them (or move them to a better task management system), I check them off.
Tack/Tank
This pair of labels keeps track of tacks we've taken in the project, and why we've decided to tank them. I use them because I found that projects often cycle back to old ideas without remembering the very good reasons they were killed in the first place. To illustrate their use, suppose we have a project meeting on Monday and decide to use MySQL. My notes on the right-hand page contain our reasoning, and I add "Tack: use MySQL" to the left-hand page, leaving some space underneath. On Tuesday, we change our minds, and decide to use SQLite instead. So now I add "Tack: use SQLite" to Tuesday's left-hand page. Then, I go back to Monday's page, and under the "Tack: use MySQL" comment, I add a Tank comment explaining why we're no longer using MySQL.

That's it. Fairly easy to use and well organized, and relatively easy to find information later. Of course, it's not ideal. What I really want is a lightweight tablet PC, about the size of my Moleskine but all screen, with a swivel keyboard, and nice note software with tags, tree-structure organization, and handwriting search. But before saying such things exist, I also want it to be affordable. Good luck to me. Until then, I'll keep buying Moleskines.

Read more...

Thursday, May 29, 2008

Industry vs. Research

As a graduate student looking for jobs, a common question I heard was "Industry or research?" Industry jobs include developers, technical managers, and even applied researchers to a large degree. Fundamental research jobs are professors and research scientists at industrial labs, such as Microsoft Research and Yahoo! Research. The definitions are somewhat fuzzy (applied research is industry? industrial labs are research?), but a generally distinguishing characteristic is whether publishing papers is a primary aspect of the job.

It was this characteristic that made me realize the fundamental difference between industry and research. It's one I wish I had known when I started graduate school. In short,

Industry is primarily about selling products, while research is primarily about selling stories

This is why publishing papers is telling: papers are a medium for selling stories. Of course, a good story helps sell a product, and a working product helps sell a story. So there is definite overlap. However, it's telling how well the pros and cons of industry and research derive from this basic difference.

To illustrate, consider some classic pros and cons. In industry, since you sell products, your work has direct impact on people that use the product. Since people will typically pay for this impact, the product itself is the source of funding. And if your product is sufficiently impactful, it is the source of a lot of funding, and you get rich. However, this means it's critical to quickly and consistently create marketable products. The result is a dampening effect on the problems targeted by industry: they are dictated by the market, and typically have shorter-term visions with fewer (or at least more calculated) risks.

In contrast, research has significantly more freedom in the problems it tackles. They are often longer-term, riskier visions. Research can do this because it only has to sell stories describing core ideas, not fully working products. Thus, it can focus on interesting technical problems. However, "selling" a story does not usually mean for money, but rather convincing people that it describes a good idea (e.g., getting a paper accepted to a conference). Since neither the story nor the idea generates money directly, researchers must seek out external funding such as grants, or, in industrial labs, income from products (which, to be fair, often contain the final fruits of research).

Given such pros and cons, the distinction of product vs. story seems obvious in hindsight. However, what made me first realize it was a more subtle situation. My advisor asked me to devise a data model for the system we're building. I came back with two options: a very common model, and a novel model that was simpler and more expressive. I favored the novel model, but my advisor said we should use the common one. His reason was that the data model was not our primary contribution, and papers with too many innovations can confuse readers. And he was right. Even though the novel model would make for a better system, the common model makes for a better story — and I'm currently in the business of selling stories. At some later date, after we sell our current story, we may sell another story that focuses on a new data model.

To conclude, I want to say that this isn't meant to promote either industry or research. In my particular case, I've found that I lean more towards selling products than stories. However, I've spoken with both developers and researchers, and both agree with the product vs. story differentiation, and each prefers their side. Of course, I'd love to hear from anyone else on the topic. I just think that understanding this difference is vital to making an informed decision about graduate school, and life afterwards.

Read more...

Sunday, May 25, 2008

Job Decision

It's finally official. After nearly two months of applications, interviews, travel, negotiation, introspection, and extremely hard thinking, I've made a job decision. Come January next year, I'll start as a Program Manager in Microsoft's SQL Server team.

This was a very difficult decision, as I had to choose between five compelling offers. In the end, there were two primary considerations: location, and how I want to contribute to my field.

My offers spanned two locations: Microsoft in Seattle, and the others in Silicon Valley. I characterized my options as better quality of life in Seattle vs. proximity to networking and friends in Silicon Valley. Seattle's quality of life is better due to lower cost of living, much cheaper housing (I can actually afford a nice house my first year), and significantly nearer mountains. It also feels more laid-back. On the other hand, Silicon Valley hosts constant interaction between innumerable tech companies, providing excellent networking opportunities and mobility. Also, several of my friends live there.

For me, Seattle and Silicon Valley were effectively tied. However, this was a two-person decision, so Sarah joined me in visiting both places. She met and loved my Silicon Valley friends, and received a great tour of Seattle courtesy of Microsoft. Sarah sees locations differently than I do. I pick a job, and that decides the location; Sarah picks a location, then finds a job. Location is part of how she defines where she wants her life headed. As it happens, before we were engaged, she was already looking to move to the Pacific Northwest. Thus, though she liked California, and especially my friends, Washington is closer to where she wants to be. This was one consideration.

The other strong consideration clarified after many conversations with mentors. The key question is how I want to contribute to my field. One path is as a technical luminary, with primarily technical contributions. This path includes god-like developers, researchers, and other deeply technical people. My offers at IBM, Oracle, and Yahoo! followed this path. Another path is as a technical manager, with primarily leadership and strategic contributions. This path includes general managers, CEOs, and other big-picture people. My offers at Google and Microsoft followed this path. I've spent most of my life as a deep techie. However, due to some eye-opening experiences and a lot of introspection, I've decided that, at least currently, my calling is management and leadership.

Neither of these considerations alone decided me. But due to both together, plus several others secondary, I've accepted the Microsoft offer. A couple things in particular really impressed me about the position. First, I got to meet several team members, including my future boss, and they're all amazing. Second, Microsoft is very serious about investing in people and building careers, so the opportunities for mentorship and advancement are fantastic. I'm extremely excited, and really looking forward to starting. All that's left is to finish my doctorate!

Finally, to wrap up, I want to very sincerely thank everyone who helped me throughout this process. All of my mentors for their advice; all of my friends for their time, love, connections, and support; and all of my family for putting up with weeks of waffling (individuals may fall into more than one category). I know not everyone will be happy with my decision, but I hope you will all be happy for me. Of course, feel free to send along any particularly strong variations on "You fool!". I promise no hard feelings.

Read more...

Sunday, March 30, 2008

Interviews and Conference and Travel, Oh My!

The last few weeks have been insane, and the next few weeks promise to be just as crazy. I've started my job search, which includes a lot of interviews. I've already interviewed on-site at Microsoft for two positions, and received offers from both. On Monday, I have two phone interviews with Google, again for two positions. Then, in the upcoming weeks, I have on-site interviews at Rapleaf, Oracle, IBM Almaden, and Yahoo! Research.

Just interviewing involves a lot of travelling. However, as added fun, I'll also be in Cancun for ICDE 2008, where I'll be presenting one of my papers.

So between interviews, the conference, and associated travelling, there's basically no time for posting. However, once all is over, I'll share interview resources, conference tidbits, and maybe details about the work I'm presenting. At any rate, just didn't want everyone to think I'd given up on blogging.

Read more...

Wednesday, March 12, 2008

Tasty Favorites: Pedro Pasta

At the urging of a couple persuasive friends, I'd like to start sharing the results of one of my pastimes: cooking. Though I enjoy making somewhat involved meals, I also like coming up with tasty foods that encourage me to eat at home. This means quick, cheap, and with minimal clean-up.

One favorite is what Sarah calls "Pedro Pasta". It's simple, delicious, and (not counting the time to boil water) takes about five minutes.

Photos courtesy of Sarah

First, an aside: I'm a huge fan of Michael Chu's Cooking for Engineers, especially his tabular recipe notation. It's an extremely elegant and concise way of displaying recipes, so I'm borrowing it for these posts.

For this recipe, I'm not too picky about quantities. The recipe is simple enough that it's easy to pick appropriate quantities based on how many people you're serving, and taste.

Pedro Pasta
Water Boil Cook Drain Toss Let sit
5 min.
Pasta
Cherry Tomatoes Cut half in half Combine
(squeeze cut)
Stir
Olive Oil
Garlic Mince
Italian Seasoning
Salt

Steps:

  1. Boil water for pasta. I generally add salt to the water to flavor the pasta and help cook it (salt water boils hotter).
  2. While the water boils, prepare the other ingredients: cut about half of the cherry tomatoes in half, and mince the garlic. If you're using fresh herbs instead of the dried "Italian Seasoning" mix (basil, oregano, rosemary, etc.), chop up the herbs as well.
  3. Add pasta to the boiling water. I like spaghetti or angel hair, but any pasta works fine.
  4. While the pasta cooks, prepare the sauce: put olive oil in a large bowl, and squeeze into it the cherry tomatoes you cut in half. Then stir in the tomatoes (both squeezed and whole), garlic, herbs, and salt to taste. You'll want enough olive oil/tomato mixture to coat the pasta, but this is also largely a matter of taste.
  5. Once the pasta has finished cooking, put it into the bowl with the sauce, and toss it.
  6. Finally, let sit for five minutes or so. The heat from the pasta will cook the garlic, and generally help spread the flavors around.

And that's it. Serve with some grated Parmesan cheese, and you have a quick and tasty meal. Of course, some may complain that it's lacking in protein. To address this, I make an accompaniment:

Pedro Pasta Accompaniment
Garlic Olive Oil Heat (med-low) Lightly
sautee
Sautee Cook (med)
Extra Firm Tofu Cube
Garlic Mince
Salt
Italian Seasoning
Lemon Pepper
Red Wine

Steps:

  1. Heat some garlic olive oil over medium-low heat (regular olive oil works fine if you don't have the garlicky variety).
  2. Add cubed extra firm tofu and minced garlic, then sautee lightly, just until the tofu starts to get a bit golden.
  3. Add salt, Italian seasoning, and lemon pepper to taste (you can buy lemon pepper at just about any grocery store). Continue sauteeing until the tofu is nice and golden.
  4. Add a splash of red wine, then turn the heat up to medium and cook until the tofu browns (or however you like it).

Once finished, just throw it in with the Pedro Pasta.

If you're not vegetarian (Sarah is, but I'm not), you can use chicken instead of tofu. The steps are the same, except replace tofu with chicken, and red wine with white. Also, make sure the chicken cooks thoroughly, which may require a slightly higher temperature in the early steps.

Read more...

Thursday, February 28, 2008

Google Sites, MS SharePoint...Creating Communities

Google recently announced a new product, called Google Sites. The basic idea is that it lets you gather and share data (e.g., Google Apps documents, files, free-form wiki-style pages) pertaining to a particular purpose (e.g., business, team, project). Inevitably, Sites is being compared to Microsoft SharePoint, which addresses a similar need.

What's fascinating to me about Sites and SharePoint are how they relate to my research on community information management. Briefly, a community has a shared topic or purpose on which they have data, such as web sites, mailing lists, and documents. This data is often unstructured. I research systems that process this unstructured community data to extract structured information about entities and relationships, then provide structured services beyond keyword search (e.g., querying, browsing, monitoring). I've built a very alpha prototype system, DBLife, for the database research community, and also published some papers on the topic.

How Sites and SharePoint relate is very exciting: they build communities. Currently, my research helps builders select web pages and other data sources. However, by integrating with a Microsoft SharePoint installation or a Google Site, the data is already there. And the benefits to users is palpable. Instead of only keyword search, users would have powerful structured access methods, making the application much more useful. Truly, it'd be very exciting to see something like that happen.

Read more...

Wednesday, February 27, 2008

Bash Command-line Programming: Flow Control

Time for another post on handy techniques for command-line bash programming. This post covers some useful command-line techniques for flow control.

Even when writing quick programs on the command-line, I often need to branch or loop. Especially loop, as I often need to do something over every file in a directory, or every line in a file. Below are some techniques I commonly use. For more neat bash-isms, check out the Bash FAQ.

  • cmd && trueCmd || falseCmd: if cmd executes successfully, run trueCmd, else run falseCmd. This is a pithy version of
    if cmd; then trueCmd; else falseCmd; fi
  • while cmd; do stuff; done: execute stuff while cmd executes successfully. Use
    while true; do stuff; done
    for an infinite loop.
  • for W in words; do stuff; done: sets the variable $W to each word in words, then executes stuff. For instance, to run foo on every text file in a directory tree, use
    for W in $(find . -name '*.txt'); do foo "$W"; done
    Note that words are split automatically based on whitespace. This means that filenames with spaces will be split into multiple words (I know find has the -exec option, but it can be cumbersome, and this is just an example). To avoid splitting on whitespace, see the next tip.
    Edit: Originally, my example used /bin/ls *.txt rather than find. However, as HorsePunchKid points out,
    for W in *.txt; do foo "$W"; done
    works on filenames with whitespace (and is also cleaner). This is an excellent point, but the only expansion done after word splitting is pathname expansion, so it applies only to file globs. If you're processing the output of a command, or the contents of an environment variable, then you'll still have a word splitting problem.
  • while read L; do stuff; done: sets the variable $L to each line in stdin, then executes stuff. Use this to handle input with spaces. For example, to run foo on every text file in a directory tree, including those with spaces, use
    find . -name '*.txt' | while read L; do foo "$L"; done

The last tip has a caveat: a piped command executes in a subshell with its own scope. Thus, if I use cmd | while read L; do stuff; done, variables set in stuff are not available outside of the loop. For example, if I want to run foo on every text file, then print how many times foo succeeded, I could try this:

I=0
find . -name '*.txt' | while read L; do foo "$L" && let I++; done
echo $I

However, this prints 0. The reason is because $I outside the pipe is a different variable than $I inside the pipe. To fix this, avoid a pipe using a trick from my earlier post:

I=0
while read L; do foo "$L" && let I++; done < <(find . -name '*.txt')
echo $I

Read more...

Friday, February 22, 2008

Managing papers with GMail

As a graduate student, I read a lot of papers. Then, I often want to write notes about these papers, categorize them, find them quickly, etc. However, despite being a common problem for graduate students (or anyone else keeping track of documents), there are few free solutions that are any good. Thus, I rolled my own using GMail.

Available Solutions are Limited

Unfortunately, there aren't many free solutions for managing papers. In fact, the only decent one I've found is Richard Cameron's CiteULike. CiteULike provides all the necessities: online storage, tagging, metadata search, and note taking. It also has two other draws: one-click paper bookmarking from supported sites, and social features for sharing and collaboration.

However, CiteULike has a deal-breaker for me: its search capabilities are very limited. It provides keyword search only over paper titles, author last names, venues, and a part of the abstract (to the best of my knowledge, since it doesn't list what it searches). It does not search the paper's full text, or even your notes. This can make finding papers based on vaguely remembered information very difficult.

Using GMail to Manage Papers

To address CiteULike's limited search, I decided to manage papers with GMail. The basic idea is that I keep each paper and its notes in an email thread. Then, further notes are replies to the thread. This supports writing richly formatted notes, as well as GMail's search over each paper's full text and any notes I've written.

Below, I describe the steps to set up the solution, add a new paper, take notes on a paper, and find a paper I've read. Finally, I compare the advantages of using this solution to using CiteULike.

Setup

Setup is trivial, consisting of creating a new gmail account for storing papers. I'll refer to this account as papers@gmail.com.

Adding a New Paper

After creating the account, I add new papers by sending email to papers@gmail.com. To ease finding the paper later, I use the following steps, which take only a minute or so:

  1. Start a new email to papers@gmail.com. Then, fill in the paper information. The key is to put the paper title as the subject, and include the author name, venue, and any other metadata you may want to search for later. The image to the right is an example (click to enlarge).
  2. Attach the PDF or PS file of the paper to the email.
  3. Send the email. Since it's to me, it will appear in my inbox.
  4. Respond to the email with the full text of the paper (if necessary, delete any other text first). To get the text from the PS or PDF file, I use the pstotext or ps2ascii Linux programs. The xclip program is handy for putting the text in the clipboard, from which I paste it into the response.

These steps accomplish three things. First, they store the PDF or PS of the paper in GMail. Second, they make the paper's full text searchable. Finally, they put the paper's author, venue, year, and other important data in an email with an attachment. This last is important because it lets me search over just this information by restricting the search to emails with attachments (see the section on finding papers below).

Taking Notes

Papers I have added but not finished reading are in my inbox. As I read a paper, I add notes by replying to the conversation from within GMail (first deleting the quoted text). Thus, my notes can use GMail's rich text features, such as lists and bolding.

Once I finish reading a paper, I tag the conversation with appropriate tags. Finally, I archive the conversation.

Finding a Paper

To find a paper, I use GMail's search functionality. This searches the full paper text and all notes, and supports searching on tags and dates. Furthermore, due to how I add papers, I can find paper titles by restricting the search to email subjects, or restrict it to emails with attachments to find author names, venues, and other information in the first email of each paper.

Comparing Solutions

Given the above procedure, GMail can compete with CiteULike as a system for managing papers. However, though better in some ways, it is also limited in others.

Specifically, my solution has these limitations:

  • No one-click adding of papers from supported sites.
  • No automatic BibTex generation. However, though not quite as good, BibTex entries from Citeseer, Google Scholar, or other sites can still be saved as notes.
  • Can't easily edit existing notes. Instead, must copy and paste the old note into a new note, then delete the original.
  • No social or community features, such as sharing papers.

However, my solution has these advantages:

  • Can search the full text of papers and notes.
  • Supports more sophisticated searches, including dates.
  • Richly formatted notes, and a nice interface for writing and reading them.
  • Can easily print or forward one or all notes about a paper (tip: before printing/forwarding all notes, delete the note containing the full paper text, then restore it afterwards).

Depending on what advantages are more important to you, it may be worth giving this a try.

Read more...

Monday, February 11, 2008

Bash Command-line Programming: Redirection

The bash shell is also a pretty handy programming language. One way to use this is writing scripts. However, another use is writing ad-hoc, one-time-use programs, for very specific tasks, right on the command line. I do this a lot, and find myself using the same techniques over and over.

In this post, I'll share some useful command-line techniques for redirection.

There are many ways other than pipes for redirecting stdin and stdout:

  • cmd &>file: send both stdout and stderr of cmd to file. Equivalent to cmd >file 2>&1.
  • cmd <file: pipes the contents of file into cmd. Similar to cat file | cmd, except that while pipes execute in a subshell with their own scope, this keeps everything in the same scope.
  • cmd <<<word: expands word and pipes it into cmd. word can be anything you'd type as a program argument. For example, cmd <<<$VAR pipes the value of $VAR into cmd.

Also, sometimes programs need arguments on the command line, rather than through stdin:

  • cmd $(<file): expands the contents of file as arguments to cmd. For example, if the file toRemove contains a list of files, rm $(<toRemove) removes those files.
  • cmd1 <(cmd2): creates a temporary file containing the output of cmd2, then puts the name of that file as an argument to cmd1. This is handy when cmd1 expects filename arguments. For example, to see the difference between the contents of directories dir1 and dir2, use diff <(ls dir1) <(ls dir2). This is conceptually equivalent to
    ls dir1 >/tmp/contentsDir1
    ls dir2 >/tmp/contentsDir2
    diff /tmp/contentsDir1 /tmp/contentsDir2
    rm /tmp/contentsDir1 /tmp/contentsDir2
    
    (only conceptually, though, since it actually uses fifos). For another handy command for this, check out comm.

Finally, you sometimes want to redirect to and from multiple programs at once:

  • {cmd1; cmd2; cmd3;} | cmd: pipes output of cmd1, cmd2, and cmd3 to cmd.
  • cmd | tee >(cmd1) >(cmd2) >(cmd3) >/dev/null: pipes output of cmd to cmd1, cmd2, and cmd3 in parallel. This trick is a tweak on that here. In the same way <(cmd) is replaced with a file containing the stdout of cmd, >(cmd) is replaced with a file that becomes the stdin of cmd. Since tee writes its stdin to each given file, you can combine it with >(cmd) to send the output of one command to the stdin of many. The final >/dev/null discards the stdout of tee, which we no longer need. Doesn't come up too often, but it's certainly neat.

Read more...