Regular Expressions – Extra Ending Match

A few days ago I was working on a very primitive version of a CSV reader for a quick JS project and while testing my regex out, I noticed that I was getting an extra match at the end. Here is the JavaScript function that I had:

function parseCSV(str) {
  var row = [], rows = [row];
  str.replace(/([^",\r\n]*|"((?:[^"]+|"")*)")(,|\r|\r?\n|$)/g, function(match, cell, quoted, delimiter) {
    row.push(quoted ? quoted.replace(/""/g, '"') : cell);
    if (delimiter && delimiter != ',') {
      rows.push(row = []);
    }
  });
  return rows;
}

Interestingly, if you pass "Name,DOB" into parseCSV() an extra cell will be added:

> parseCSV('Name,DOB')
[["Name", "DOB", ""]]

Without diving into the function too much to see why it should work with a good regex, you will notice that finding all of the matches for my CSV parsing regex produces an interesting result:

> 'Name,DOB'.match(/([^",\r\n]*|"((?:[^"]+|"")*)")(,|\r|\r?\n|$)/g)
["Name", "DOB", ""]

After thoroughly analyzing my regex, I started to think there was something wrong with the JS implementation of the regular expression engine. I also thought there might be something wrong with my regular expression so I made a simpler one and saw the following:

> 'hello,world'.match(/[^,]*(?:,|$)/g)
["hello", "world", ""]

That seems pretty strange, right? I continued investigating and asked around the office to see if others thought it was weird and they agreed. Therefore I decided to try it out in Python to see if there was just a peculiarity in the JS engine:

>>> import re
>>> re.findall('[^,]*(?:,|$)', 'hello,world')
['hello', 'world', '']

Finally, I started thinking about how I would create a regular expression engine that would look for all instances of the empty string and still not result in an infinite loop. The reason I did this was because I knew that running the following doesn’t result in an infinite loop in Python:

>>> import re
>>> re.findall('', 'Yo')
['', '', '']

I tried the same thing in JS:

> 'Yo'.match(/(?:)/g)
["", "", ""]

It seems that if the matched substring has a length of zero, the next search will start one character past the match’s starting/ending index. On the other hand, if the match contains at least one character, the next search will start at the index after the last character matched. Therefore, let’s consider my simplified regular expression again:

> 'hello,world'.match(/[^,]*(?:,|$)/g)
["hello", "world", ""]
  1. The first time the regex engine tries to find a match it uses a greedy search to find zero or more word characters right before a comma or the end of the string.
  2. It finds hello, and the end index is 6.
  3. Now since the last match was not the empty string the regex engine doesn’t try to advance the starting position of the next search and simply evaluates against "world".
  4. It finds world and the end index is 11 (5 plus the offset of 6).
  5. Again since the last match was not the empty string the regex engine doesn’t try to advance the starting position of the next search simply evaluates against the empty string.
  6. Since the empty string matches the regular expression, the third string found is the empty string and the ending index remains 11.
  7. Finally the regex engine looks to see if the previous match was the empty string and since it was, it tries to advance the starting index by one but realizes that is outside the bounds of the string and therefore there is no need to continue.

The steps listed above are simply used to validate the reasoning behind why regex engines would find three matches for matching /[^,]*(,|$)/g against "hello,world". In the event that I would want to use a similar regex which doesn’t allow the empty string at the end, I could use /(?!$)[^,]*(?:,|$)/g. In conclusion, even though I thought I knew all of the strange edge cases for regexes in JS, I found that I still have more to learn! 8-)

3 thoughts on “Regular Expressions – Extra Ending Match”

  1. @paulc, that is funny that I didn’t totally think about the practicality of the shorter test regex. I decided to rewrite the simple regex to make it more similar to the CSV parsing one. The reason I don’t use + is because I want to allow for zero characters. Of course, the fixed version allows for zero characters in all positions except the last. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *


seven × 4 =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="">