Regular Expressions – Extra Ending Match

A few days ago I was working on a very primitive version of a CSV reader for a quick JS project and while testing my regex out, I noticed that I was getting an extra match at the end. Here is the JavaScript function that I had:

function parseCSV(str) {
  var row = [], rows = [row];
  str.replace(/([^",\r\n]*|"((?:[^"]+|"")*)")(,|\r|\r?\n|$)/g, function(match, cell, quoted, delimiter) {
    row.push(quoted ? quoted.replace(/""/g, '"') : cell);
    if (delimiter && delimiter != ',') {
      rows.push(row = []);
    }
  });
  return rows;
}

Interestingly, if you pass "Name,DOB" into parseCSV() an extra cell will be added:

> parseCSV('Name,DOB')
[["Name", "DOB", ""]]

Without diving into the function too much to see why it should work with a good regex, you will notice that finding all of the matches for my CSV parsing regex produces an interesting result:

> 'Name,DOB'.match(/([^",\r\n]*|"((?:[^"]+|"")*)")(,|\r|\r?\n|$)/g)
["Name", "DOB", ""]

After thoroughly analyzing my regex, I started to think there was something wrong with the JS implementation of the regular expression engine. I also thought there might be something wrong with my regular expression so I made a simpler one and saw the following:

> 'hello,world'.match(/[^,]*(?:,|$)/g)
["hello", "world", ""]

That seems pretty strange, right? I continued investigating and asked around the office to see if others thought it was weird and they agreed. Therefore I decided to try it out in Python to see if there was just a peculiarity in the JS engine:

>>> import re
>>> re.findall('[^,]*(?:,|$)', 'hello,world')
['hello', 'world', '']

Finally, I started thinking about how I would create a regular expression engine that would look for all instances of the empty string and still not result in an infinite loop. The reason I did this was because I knew that running the following doesn’t result in an infinite loop in Python:

>>> import re
>>> re.findall('', 'Yo')
['', '', '']

I tried the same thing in JS:

> 'Yo'.match(/(?:)/g)
["", "", ""]

It seems that if the matched substring has a length of zero, the next search will start one character past the match’s starting/ending index. On the other hand, if the match contains at least one character, the next search will start at the index after the last character matched. Therefore, let’s consider my simplified regular expression again:

> 'hello,world'.match(/[^,]*(?:,|$)/g)
["hello", "world", ""]
  1. The first time the regex engine tries to find a match it uses a greedy search to find zero or more word characters right before a comma or the end of the string.
  2. It finds hello, and the end index is 6.
  3. Now since the last match was not the empty string the regex engine doesn’t try to advance the starting position of the next search and simply evaluates against "world".
  4. It finds world and the end index is 11 (5 plus the offset of 6).
  5. Again since the last match was not the empty string the regex engine doesn’t try to advance the starting position of the next search simply evaluates against the empty string.
  6. Since the empty string matches the regular expression, the third string found is the empty string and the ending index remains 11.
  7. Finally the regex engine looks to see if the previous match was the empty string and since it was, it tries to advance the starting index by one but realizes that is outside the bounds of the string and therefore there is no need to continue.

The steps listed above are simply used to validate the reasoning behind why regex engines would find three matches for matching /[^,]*(,|$)/g against "hello,world". In the event that I would want to use a similar regex which doesn’t allow the empty string at the end, I could use /(?!$)[^,]*(?:,|$)/g. In conclusion, even though I thought I knew all of the strange edge cases for regexes in JS, I found that I still have more to learn! 😎

Python – List Comprehension of Updated Dictionaries

First of all, i’m sure that the title is a bit confusing, but I couldn’t think of a good way to state in a few words the scope of a solution I was looking for today. Basically, I had a list of dictionaries and I wanted to create a new list of dictionaries with an additional key-value pair. For example, let’s say I have a list of dictionaries which define a and b:

my_list = [
    { 'a': 1, 'b': 2 },
    { 'a': 2, 'b': 3 },
    { 'a': 3, 'b': 4 }
]

What I wanted to do was create a new list containing copies of the dictionaries with a new key-value pair. After finding this stackoverflow answer I came up with a solution similar to the following:

import math
list1 = [
    {'a': 1, 'b': 2},
    {'a': 2, 'b': 3},
    {'a': 3, 'b': 4}
]
list2 = [dict(d.items() + [('c', math.hypot(*d.values()))]) for d in list1]
# [
#     {'a': 1, 'c': 2.23606797749979, 'b': 2},
#     {'a': 2, 'c': 3.605551275463989, 'b': 3},
#     {'a': 3, 'c': 5.0, 'b': 4}
# ]

Cool stuff, right? Do you know what is actually happening? d.items() actually turns each key-value pair into a tuple and returns them all as a list. Then we are concatenating our tuple to make a new list of three tuples. In this case doing math.hypot(*args) actually passes each item in the list as a separate argument (here is more information about using an apply-like function in Python). After that dict(...) takes that list of tuples and creates a dictionary from them. Finally, this is done for each dictionary in list1, add the new dictionary to a new list which is assigned to list2.

How cool is that?!?!?! Even though I still am a big fan of JavaScript, the fact that you can write this type of code in Python is pretty inspiring! 😎

Python – Convert Excel Column Name To Number

A few days I wrote a post about how to convert an cell address into coordinates (column number, and row number). I am working a quick Python script for work to import CSV data into a DB. In order to allow the user the ability to enter either the column number or the column name I used the following function to convert column names into number:

def colNameToNum(name):
    pow = 1
    colNum = 0
    for letter in name[::-1]:
            colNum += (int(letter, 36) -9) * pow
            pow *= 26
    return colNum

As you can see, all you have to do is pass it the name of the column (letters) and the correspond column number will be returned. Have fun! 😎