JavaScript Snippet – Simple CSV Parser

As I mentioned in yesterday’s post, I was recently working on a quick way to parse CSVs into an array of arrays and alternatively into an array of dictionaries keyed on the values in the first row. I ended up landing on the following definition:

As you can see the function is annotated for anyone that may be interested in using. Let’s see example of how it could be used. Here is an example string that can be parsed:

ID,First Name,Last Name,Address,Last Purchase Date,Purchase Amount,Comment,Return Customer
1,Don,Knots,"123 Main St.,
Duggietown, ET 12342",10/23/2013,23.43,"""Doesn't like cheese"" according to his mom.",Y
2,Cher,Vega,"92 Victor Ln.
Rutrow, DA 39252",01/12/2013,588.1,,N
3,Tina,Ray,"1111 Yomdip Circle
Bribloop, EV 92341",02/03/2013,234.2,,Y
4,Charlie,Bucket,"745 Caca Pl.
Hastiville, JS 92293",05/06/2013,345.4,,N

Below is an example of processing the above CSV first as an array of arrays, then as an array of dictionaries (objects), and lastly as an array of dictionaries with typed values:

As you can see from the jPaq Proof above, this parser works well with the majority of the CSVs that you would need to process. Still, in the case that you need a fully-fledged CSV parser, Papa Parse seems to be a pretty good solution. Have fun! 😎

Regular Expressions – Extra Ending Match

A few days ago I was working on a very primitive version of a CSV reader for a quick JS project and while testing my regex out, I noticed that I was getting an extra match at the end. Here is the JavaScript function that I had:

function parseCSV(str) {
  var row = [], rows = [row];
  str.replace(/([^",\r\n]*|"((?:[^"]+|"")*)")(,|\r|\r?\n|$)/g, function(match, cell, quoted, delimiter) {
    row.push(quoted ? quoted.replace(/""/g, '"') : cell);
    if (delimiter && delimiter != ',') {
      rows.push(row = []);
    }
  });
  return rows;
}

Interestingly, if you pass "Name,DOB" into parseCSV() an extra cell will be added:

> parseCSV('Name,DOB')
[["Name", "DOB", ""]]

Without diving into the function too much to see why it should work with a good regex, you will notice that finding all of the matches for my CSV parsing regex produces an interesting result:

> 'Name,DOB'.match(/([^",\r\n]*|"((?:[^"]+|"")*)")(,|\r|\r?\n|$)/g)
["Name", "DOB", ""]

After thoroughly analyzing my regex, I started to think there was something wrong with the JS implementation of the regular expression engine. I also thought there might be something wrong with my regular expression so I made a simpler one and saw the following:

> 'hello,world'.match(/[^,]*(?:,|$)/g)
["hello", "world", ""]

That seems pretty strange, right? I continued investigating and asked around the office to see if others thought it was weird and they agreed. Therefore I decided to try it out in Python to see if there was just a peculiarity in the JS engine:

>>> import re
>>> re.findall('[^,]*(?:,|$)', 'hello,world')
['hello', 'world', '']

Finally, I started thinking about how I would create a regular expression engine that would look for all instances of the empty string and still not result in an infinite loop. The reason I did this was because I knew that running the following doesn’t result in an infinite loop in Python:

>>> import re
>>> re.findall('', 'Yo')
['', '', '']

I tried the same thing in JS:

> 'Yo'.match(/(?:)/g)
["", "", ""]

It seems that if the matched substring has a length of zero, the next search will start one character past the match’s starting/ending index. On the other hand, if the match contains at least one character, the next search will start at the index after the last character matched. Therefore, let’s consider my simplified regular expression again:

> 'hello,world'.match(/[^,]*(?:,|$)/g)
["hello", "world", ""]
  1. The first time the regex engine tries to find a match it uses a greedy search to find zero or more word characters right before a comma or the end of the string.
  2. It finds hello, and the end index is 6.
  3. Now since the last match was not the empty string the regex engine doesn’t try to advance the starting position of the next search and simply evaluates against "world".
  4. It finds world and the end index is 11 (5 plus the offset of 6).
  5. Again since the last match was not the empty string the regex engine doesn’t try to advance the starting position of the next search simply evaluates against the empty string.
  6. Since the empty string matches the regular expression, the third string found is the empty string and the ending index remains 11.
  7. Finally the regex engine looks to see if the previous match was the empty string and since it was, it tries to advance the starting index by one but realizes that is outside the bounds of the string and therefore there is no need to continue.

The steps listed above are simply used to validate the reasoning behind why regex engines would find three matches for matching /[^,]*(,|$)/g against "hello,world". In the event that I would want to use a similar regex which doesn’t allow the empty string at the end, I could use /(?!$)[^,]*(?:,|$)/g. In conclusion, even though I thought I knew all of the strange edge cases for regexes in JS, I found that I still have more to learn! 😎

JavaScript Snippet – Undo Camel Case

Probably due to it being so late, I was looking for code to uncamelize (undo camel-casing) any string. I came across what claimed to be a solution in PHP but unfortunately did nothing but lowercased my string. Therefore I decided to write my own solution:

function uncamelize(s) {
  return s.replace(/[A-Z]/g, '_$&').toLowerCase();
}

Believe it or not, the solution is that simple. Here is an example of using it:

An interesting thing to note about this uncamelize implementation is that it uses the $& pattern to reuse the substring matched by the regular expression. Even though this substring is commonly used it is documented as shown here.