Peter Anthony Victor Diberardino pavdiberardino at edu.uwaterloo.ca
Thu Mar 14 12:26:19 EDT 2019

I recently discovered a useful programming principle while working on the Plinko analysis. I am sharing it here to document my findings, and hopefully save others from future headaches of this sort.

When working with data.tables in R, or any other structured data type, it is important to make as few assumptions about the ‘state’ of the data as possible. Concretely, try to rely on the defined properties of a database, rather than the way in which that data may be presented.

Consider the following table:

Table 1

ID Trial Condition
1 1 a
1 2 a
1 3 b
1 4 b
2 1 a
2 2 a
2 3 b
2 4 b

The inherent properties of this table require that IDs are in the ID column, a/b in the Condition column etc. We also cannot ‘shuffle’ one column at a time, because each row corresponds to one particular trial so it must remain together.

However, as long as we keep each row intact, the rows can be reordered in any way, and the table will remain valid and represent the same data as it did before.

This table is equivalent to the first table:

Table 2

ID Trial Condition
2 2 a
1 2 a
1 4 b
2 1 a
1 3 b
1 1 a
2 4 b
2 3 b

Due to this property, any analysis we do on tables of data should not rely on any particular ordering of the rows.

For example, say we would like to add a column to indicate when a condition, or the participant, switches. This can be done using the shift() function in R, which compares the entry of the current row to the entry in the row above. We will call this column ‘Switch’.

NEW Table 1

ID Trial Condition Switch
1 1 a NA
1 2 a FALSE
1 3 b TRUE
1 4 b FALSE
2 1 a TRUE
2 2 a FALSE
2 3 b TRUE
2 4 b FALSE

This works as expected when all rows are in this particular order.

But if we made this column when the table was in a state like Table 2:

NEW Table 2

ID Trial Condition Switch
2 2 a NA
1 2 a TRUE
1 4 b TRUE
2 1 a TRUE
1 3 b TRUE
1 1 a TRUE
2 4 b TRUE
2 3 b FALSE

Notice the Switch values for each row do not correspond with what we would have expected if the order was as Table 1. This is an example of a solution that varies based on the ordering of the data table rows.

This is important to note. The program you are using (like R) may not present your data in an order that is in the way you expect. If your data set is too large or complex for you to notice, you could be making assumptions about your data that are not true. If your solution is dependent on the ordering, this may lead to unexpected results, and a lot of time spent figuring out what went wrong.

An alternative solution:

Instead of making an indicator row that would vary based on order, we want a solution that is order invariant. That is, a solution that would give us the same results regardless of how our data is presented.

Let’s create a new column called ‘TrialSet’ that gives a unique name for the set of trials we are interested in. In this case we care about ID and condition, so we can make the values in this new column the concatenation of the values within the ID and Condition columns. This can be done with paste() in R.

GOOD Table 1

ID Trial Condition TrialSet
1 1 a 1a
1 2 a 1a
1 3 b 1b
1 4 b 1b
2 1 a 2a
2 2 a 2a
2 3 b 2b
2 4 b 2b

GOOD Table 2

ID Trial Condition TrialSet
2 2 a 2a
1 2 a 1a
1 4 b 1b
2 1 a 2a
1 3 b 1b
1 1 a 1a
2 4 b 2b
2 3 b 2b

These two tables are equivalent, as desired.

The reason why this solution is order invariant is because the entries for TrialSet are only defined by values from within a specific row. No matter the ordering of the rows, these values are accessed in the same way. In contrast, the Switch column was not defined on a particular row. It was defined by whatever row happened to be above it.

In summary, when working with tables of data, ensure that you either enforce a particular order for your rows. Even better, design solutions that are order invariant. Not doing so could result in unexpected outputs that could interfere with your analysis.

If you have any questions about this matter, or would like to hear about the specific context in which is issue arose during Plinko analysis, send me an email.

Don’t forget to send any of your programming questions or discoveries to the code forum!

Peter

PS. Today is pi day. If you like pie (or pi), head to MC 3rd floor at 3:14 to get some for free.


Britt Anderson britt at uwaterloo.ca
Thu Mar 14 12:45:52 EDT 2019

The two examples you give don’t seem exactly comparable since the /shift/ and /paste/ commands are being done inter-row and intra-row. What would the data.table command look like for the order-invariant approach to the “switch” question you solved with /shift/?

Can we do this with a vectorized data table implementation or would we have to iterate overall all trials to be truly order invariant? Sounds like it could be a brain teaser for those in the lab with some R experience.

This might also be useful to others, right Maja?

P.S. Thanks for sharing this Peter, and promoting the use of the lab mailing list for this sort of thing. Getting this kind of discussion shared and archived helps us all. More minds focused on common problems leads to better solutions, and more chances to record solutions means fewer times we need to reinvent them.


Peter Anthony Victor Diberardino pavdiberardino at edu.uwaterloo.ca
Thu Mar 14 17:45:07 EDT 2019

That is correct, the two methods I presented are not exactly comparable in terms of inter-row referencing.

The /paste/ solution was to show that you should explicitly reference the rows you want to use based on the data it contains, rather than the ordering of the rows. In this particular case, the explicit row reference was trivial as it only referenced the current row (an intra-row solution). This logic is not necessarily restricted to intra-row, however. As long as you can explicit declare the target row, the solution will be order invariant.

For the ‘Switch’ column solution, the script would need to implement the following to be order invariant: For each ID For each Row Compare the Condition value of the current row, to the Condition value of the row with the same ID, and Trial = (Trial of the current row - 1). If values are different, Switch = True Else False

At least something like this. And this may also be done through vectors in R.

Despite this rather straightforward logic, I cannot find a clean or efficient way to implement it in R. So for now I will claim that there is no easy order invariant solution for a Switch column of this sort. (Someone should try to disprove it!)

Where the two methods I presented do compare, is with respect to the goal they accomplish.

Presumably the goal of the Switch column would be to demarcate blocks of rows that need to be analyzed separately from other block of rows. With a TrialSet column-like solution, these blocks can be extracted easily with split(table, by='TrialSet'). Plus, a unique name is automatically available to name each block since the values in TrialSet uniquely identify each block. I think achieving this same outcome with the Switch column would be much more difficult.

However, if you were only interested in the particular rows where a certain attributed changed (rather than a block of rows), then the Switch solution may be better. In which case, you would just need to ensure that you enforce the order of your data table. It’s not as satisfying as a naturally order invariant solution, but analyzing data quickly, accurately, and painlessly is a higher priority.

Peter


Sean Griffin sean.griffin at uwaterloo.ca
Thu Mar 14 19:44:07 EDT 2019

Interesting / thought-provoking discussion. In terms of using the switch approach, can you turn it into a data.table and use sorting to induce the order you want to assume exists in the data?


Peter Anthony Victor Diberardino pavdiberardino at edu.uwaterloo.ca
Thu Mar 14 21:35:54 EDT 2019

In general, yes you could just enforce the order. But I’m not sure how easy it would be to keep track of that order when it has to be enforced over multiple columns and the database is very large. Also not sure how that order will be preserved if more data is added to the table. If the solution is order invariant, all of these factors don’t need to be considered. I suppose it’s a balance between designing a faster vs more robust solution, and depends on the problem.