Maja Ramljak mramljak at edu.uwaterloo.ca
Fri May 24 10:56:31 EDT 2019

I have a question related to having a column of lists in my dataset. For instance a row of my data looks something like:

Click.RT NameOfFix TimeOfFix
3.325 [‘Cat’, ‘Dog’, ‘Mouse’] [0.89, 1.22, 2.64]

I’m experimenting and trying to create a new column that stores the length of these string lists. I have this function:

listfxn <- function(x){
    for (i in alldata[[x]])
    {
    fxn <- unlist(str_split(i, boundary("word")))
    }
    length <-length(fxn)
    return(length) }

So that if I call listfxn(“NameOfFix”), with “NameOfFix” being the column name, it will return a number indicating the length of the list. Due to the way I designed my function, I assume that it’s calculating the length of each row, but only returning the value of the last row in the data.

If I try to call the function for each row of the column with something like:

alldata[,.(listfxn("Outer.Options")),by=seq(nrow(alldata))]

it takes a very long time for R to process (long enough that I get impatient and stop it from running). Is this due to the nature of my function in that it’s inefficient? Or am I calling the function on all the rows of my data incorrectly?

Another way to accomplish something similar is with gsub. For example:

test <- unlist(strsplit(gsub("\\[|\\]|\\'",'',alldata[1,NameOfFix]),','))

and when I call test, it will return “Cat” “Dog” “Mouse”. If I call length(test), it will return 3. I’m unsure if creating a function with this method would be more efficient.

Any ideas/suggestions would be appreciated.

Maja


Peter Anthony Victor Diberardino pavdiberardino at edu.uwaterloo.ca
Fri May 24 11:12:19 EDT 2019

Can you send your whole data.table containing these columns? Peter


Britt Anderson britt at uwaterloo.ca
Fri May 24 11:32:03 EDT 2019

I had to reject Maja’s response because attaching the whole data set was too large for our mailing group limit. Perhaps Maja could put the data somewhere on a lab computer (maybe dataRepos on brittlab4), and send out the location and then we could access it directly. /Britt


Peter Anthony Victor Diberardino pavdiberardino at edu.uwaterloo.ca
Fri May 24 11:37:03 EDT 2019

I received Maja’s attachment and put it here: /usr/local/lab/dataRepos/majaData.data The ‘name’ of the table will still be alldata if you load it into R.

Peter


Maja Ramljak mramljak at edu.uwaterloo.ca
Fri May 24 11:39:33 EDT 2019

Also I’ve cut down the original data so that it’s a 5x30 table (a few columns and a few rows) that represents more clearly what I explained in my first email. It should be small enough to be sent as an attachment in case anyone wanted to access it through email instead of through brittlab4.

Maja


Peter Anthony Victor Diberardino pavdiberardino at edu.uwaterloo.ca
Fri May 24 12:40:04 EDT 2019

Here is a slightly different way of doing what you have now. Admittedly more piecemeal.

Focusing on one individual list at a time, I noticed that they were in some weird python list/ factor class. This makes things difficult to work with. But it seems like you figured something out with unlist(str_split(i, boundary("word"))). Here is another way to do it. May have useful tools for others to use in the future:

lengthElement <- function(element){
    # this changes each element from a weird python list object into a simple 
    # charachter vector.
    # Note, the square brackets, commas, and "nopic" strings are still members 
    # of this new chacracter vector.
    # Useful tool here is strsplit. Splitting up this long string into peiced 
    # based on where the single quotes are. This is \' in regex.
    # Then we unlist to flatten it into a one dimensional character vector 
    stringElement <- unlist(strsplit(as.character(element), split="\'"))
    
    # We can remove the unwanted strings.
    # important tool here is subsetting. We are keeping the elements that are 
    # not within the vector of characters we dont want' In this case, those 
    # would be both square brackets and commas. And "nopic" for this specific 
    # application.
    cleanElement <- stringElement[!stringElement %in% c("[",",","]","nopic")]
    
    # Now return the lenght
    return(length(cleanElement))
} 

Now we know how to make a single python list in r into a useable charachter vector. If we want to apply the function above to each element of a data.table column, we can use a function from the apply family. In this case, I chose sapply, as it will result in a 1D vector of the same lenght as the column we input:

colListLength <- function(column){
    return(sapply(X = column, FUN=lengthElement))
}

Results in a vector the same length as column, where each element is the length of the list (excluding “nopic”) from each element of column.

You could do this for every relevant column.

Peter


Sean Griffin sean.griffin at uwaterloo.ca
Fri May 24 22:55:03 EDT 2019

In the spirit of one-liners and because the perl regexp has been critical for the eye-tracker data analysis, here’s another approach.

alldata[, N.Fix := lapply(as.character(NameOfFix), function(x) length(gregexpr("'(?!nopic)\\w", x, perl=T)[[1]]))]

Notes:

0) Ends up adding a new column to the data.table called N.Fix which contains the integer count of the “list of strings” in NameOfFix.

1) as.character(NameOfFix) changes the column’s class so the next steps work, but without altering that column’s class permanently in the actual data.table—if for some reason you have other operations in your script expecting it to be the original class—didn’t want any surprises.

2) Used a “lambda” function in lapply because: A) gregexpr takes as its first argument a string representing the regular expression match rather than the text within which you intend to search for said match—and lapply is providing the later, not the former. B) We get to compose the two other steps we need to take to get from the result of calling gregexpr to the count we want.

3) grepexpr returns multiple matches instead of the first instance of a match, unlike its friends (see ?grepexpr in R). Setting perl=T (default is F) enables us to use (?! …) in our match pattern. The match pattern says, find substrings such that the first character is a single quotation mark, the second character is a letter of the alphabet, and if you lookahead (as far as necessary to confirm this, if possible) from the quotation mark, you don’t see the substring nopic.

4) Indexing into the result using [[1]] gives us a vector with class == integer with each entry corresponding to the character index of the match in the searched text. Taking its length returns the count.

5) Not sure using semi-colons means this isn’t a one-liner anymore but the following version accounts for the possibility that the entry in NameOfFix doesn’t return any matches whatsoever (so -1 comes into play) and returns a count of 0 in those cases.

alldata[, N.Fix := lapply(as.character(NameOfFix), function(x) {result = gregexpr("'(?!nopic)\\w", x, perl=T)[[1]];if (!result[1]==-1) {length(result)} else {0}})]


Maja Ramljak mramljak at edu.uwaterloo.ca
Sun May 26 19:30 EDT 2019

Thanks to everyone who sent back a solution - I really appreciate you taking the time to not only read my question but to send something back. I took pointers from the solutions and also implemented them on my dataset, which was nice in that I was able to compare the columns I was generating to the columns generated by your solutions.

If anyone was interested, the reason my original function failed to work was because of an error with my for loop - when I called my listfxn, it would calculate the length of all the rows of the column of lists, but only return the last length - and then subsequently apply this length to all the rows in my new column (incorrectly).

If the particular row was of the image condition, the pictures were reported as jpgs (e.g. AnHa01.jpg) and by using unlist(str_split(x,boundary(“word”))), the “jpg” was reported as a word on its own. Sean and Peter’s solutions avoided this bug.

Also, since a list of the names of fixations for one trial might look something like:

[‘Key’, ‘nopic’, ‘Dog’, ‘nopic’, ‘Dog’, ‘nopic’, ‘Dog’]

with ‘nopic’ representing a fixation somewhere on the screen undefined by experimental stimuli (i.e. a fixation on a blank portion of the screen), we remove the ‘nopic’ to return [‘Key’, ‘Dog’, ‘Dog’, ‘Dog’]. By taking the length of this list, if tells us that 4 defined fixations occurred. Depending on the analysis you’re doing/assumption your making, it might be beneficial to find the length of the unique fixations that occurred or something along those lines.

I did this with the following (plagiarizing a bit from Peter’s function) :

uniqueList <- function(x){

   fxn <- unlist(str_split(x,boundary("word")))

   clean <- fxn[!fxn %in% c("[",",","]", "jpg", "nopic")]

   uniqueClean <- unique(clean)

   listlen <- length(uniqueClean)

   return(listlen)

}

Just a though for anyone working with eyetracking data in the future.

Maja


Peter Anthony Victor Diberardino pavdiberardino at edu.uwaterloo.ca
Tue May 28 14:13:12 EDT 2019

Glad something worked out. Separate from your original question, the solution you provided is a great case to demonstrate the concept of ‘piping’ using the r package ‘magrittr’. Your original solution:

uniqueList <- function(x){
   fxn <- unlist(str_split(x,boundary("word")))
   clean <- fxn[!fxn %in% c("[",",","]", "jpg", "nopic")]
   uniqueClean <- unique(clean)
   listlen <- length(uniqueClean)
   return(listlen)
}

Notice how there is a new variable defined each line, where each variable is some sort of mutation of the variable from the line above. This line by line programming is very explicit and easy to read (especially with your informative naming convention). But sometimes, the step you take on a certain line, may be clear to see by the operation itself, deeming the new variable name redundant. For example, uniqueClean <- unique(clean).

To avoid this, you can use a technique called piping. Piping takes the object from the line above, and sends it directly to the next line, removing the need to define a new variable. There are two key expression you need to know for simple use of magrittr piping: “%>%” which goes at the end of a line to signal you’re piping that result to the next line. “.” which is used within the next line as a placeholder for the result piped from the line above.

Applying these concepts to Maja’s uniqueList function, we get:

uniqueListPiped <- function(x){
    listlen <- unlist(str_split(x,boundary("word"))) %>%
        .[!. %in% c("[",",","]", "jpg", "nopic")] %>%
        unique(.) %>%
        length(.)
    return(listlen)
}

Notice how we only need to come up with one variable name (describing our intended final result), while each intermediate step remains separated line by line.

You can read more about the magrittr package here:

https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html

Peter