Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
259 views
in Technique[技术] by (71.8m points)

r - Warning: 'Invalid .internal.selfref detected' when adding a column to a data.table returned from a function

This seems as fread bug, but I am not sure.

This example reproduce my problem. I have a function where I read a data.table and return it in a list. i use list to group other results in the same structure. Here my code:

ff.fread <- function(){
  dt = fread("x
1
2
")
  list(dt=dt)   
}

DT.f <- ff.fread()$dt

Now when I try to add a new column to DT.f, it works but I get a warning message:

DT.f[,y:=1:2]
Warning message:
In `[.data.table`(DT.f, , `:=`(y, 1:2)) :
  Invalid .internal.selfref detected and fixed by taking a copy of the whole
  table so that := can add this new column by reference. At an earlier point,
  this data.table has been copied by R (or been created manually using
  structure() or similar). Avoid key<-, names<- and attr<- which in R currently
  (and oddly) may copy the whole data.table. Use set* syntax instead to avoid
  copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0, list(DT1,DT2) copied
  the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade
  to R>=v3.1.0 if that is biting. If this message doesn't help, please report to
  datatable-help so the root cause can be fixed.

Note the if I create the data.table manually I don't have this warning. This works fine for example:

ff <- function(){
      list(dt=data.table(x=1:2))
    }
DT <- ff()$dt
DT[,y:=1:2]

Or if I don't return the result of fread within a list , it works also fine

ff.fread <- function(){
  dt = fread("x
1
2
")
  dt
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This has nothing to do with fread per se, but that you're calling list() and passing it a named object. We can recreate this by doing:

require(data.table)
DT <- data.table(x=1:2)       # name the object 'DT'
DT.l <- list(DT=DT)           # create a list containing one data.table
y <- DT.l$DT                  # get back the data.table
y[, bla := 1L]                # now add by reference
# works fine but warning message will occur

DT.l = list(DT=data.table(x=1:2))   # DT = a call, not a named object
y = DT.l$DT
y[, bla:=1L]
# works fine and no warning message

Good news:

The good news is that from R version >= 3.1.0 (now in devel), passing a named object to list() will no longer create a copy, rather, its reference count (number of objects pointing to this value) just gets bumped. So, the problem goes away with the next version of R.

To understand how data.table detects copies using .internal.selfref, we'll dive into some history of data.table.

First, some history:

You should know that data.table over-allocates column pointer slots (truelength is set to a default of 100) on creation so that := can be used to add columns by reference later on. There was one issue with this as such - handling copies. For example, when we call list() and pass it a named object, a copy is being made, as illustrated below.

tracemem(DT)
# [1] "<0x7fe23ac3e6d0>"
DT.list <- list(DT=DT)    # `DT` is the named object on the RHS of = here
# tracemem[0x7fe23ac3e6d0 -> 0x7fe23cd72f48]: 

The problem with any copy of data.table that R makes (not data.table's copy()) is that R internally sets the truelength parameter to 0 even though truelength(.) function will still return the correct result. This inadvertently led to a segfault when updated by reference with :=, because, the over-allocation didn't exist anymore (or at least is not recognised anymore). This happened in versions < 1.7.8. In order to overcome this, an attribute called .internal.selfref was introduced. You can check this attribute by doing attributes(DT).

From NEWS (of v1.7.8):

o The 'Chris crash' is fixed. The root cause was that key<- always copies the whole table. The problem with that copy (other than being slower) is that R doesn't maintain the over allocated truelength, but it looks as though it has. key<- was used internally, in particular in merge(). So, adding a column using := after merge() was a memory overwrite, since the over allocated memory wasn't really there after key<-'s copy.

data.tables now have a new attribute .internal.selfref to catch and warn about such copies in future. All internal use of key<- has been replaced with setkey(), or new function setkeyv() which accepts a vector, and do not copy.

What does this .internal.selfref do?

It just points to itself, basically. It's simply an attribute attached to DT that contains the address in RAM of DT. If R inadvertently copies DT, the address of DT will move in RAM but the attribute attached will still contain the old memory address, they won't match any more. data.table checks they do match (i.e. is valid) before adding a new column by reference into a spare column pointer slot.

How is .internal.selfref implemented ?

In order to understand this attribute .internal.selfref, we've to understand what an external pointer (EXTPTRSXP) is. This page explains nicely. Copy/pasting the essential lines:

External pointer SEXPs are intended to handle references to C structures such as handles, and are used for this purpose in package RODBC for example. They are unusual in their copying semantics in that when an R object is copied, the external pointer object is not duplicated.

They are created as:

SEXP R_MakeExternalPtr(void *p, SEXP tag, SEXP prot);

where p is the pointer (and hence this cannot portably be a function pointer), and tag and prot are references to ordinary R objects which will remain in existence (be protected from garbage collection) for the lifetime of the external pointer object. A useful convention is to use the tag field for some form of type identification and the prot field for protecting the memory that the external pointer represents, if that memory is allocated from the R heap.

In our case, we create the attribute .internal.selfref of/for DT, whose value is an external pointer to NULL (the address of which you see in the attribute value) and this external pointer's prot field is another external pointer back to DT (hence referred to as selfref) with its prot set to NULL this time.

Note: We've to employ this extptr to NULL whose 'prot' is an extptr strategy so that identical(DT1, DT2) which are two different copies, but with same content returns TRUE. (If you don't understand what this means, you can just skip to the next part. It's not relevant to understanding the answer to this question).

Okay, how does this all work then?

We know that the external pointer does not get duplicated during a copy. Basically, when we create a data.table, the attribute .internal.selfref creates an external pointer to NULL with it's prot field creating an external pointer back to DT. Now, when an unintentional "copy" is being made, the object's address gets modified but not the address protected by the attribute. It still points to DT whether it exists or not.. because it won't/can't be modified. This is therefore detected internally by checking the address of the current object and the address protected by the external pointer. If they don't match, then a "copy" has been made by R (that would have lost the over-allocation that data.table carefully created). That is:

DT <- data.table(x=1:2) # internal selfref set
DT.list <- list(DT=DT)  # copy made, address(DT.list$DT) != address(DT)
                        # and truelength would be affected.

DT.new <- DT.list$DT    # address of DT.new != address of DT
                        # and it's not equal to the address pointed to by
                        # the attribute's 'prot' external pointer

# so a re-over-allocation has to be made by data.table at the next update by
# reference, and it warns so you can fix the root cause by not using list(),
# key<-, names<- etc.

That's a lot to take in. I think I've managed to get it through as clear as possible. If there're any mistakes (it took me a while to wrap this around my head) or possibilities for further clarity, feel free to edit or comment with your suggestions.

Hope this clears up things.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...