在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
This document is meant to supplement the perlguts(1) manual page that comes with Perl. It contains commented illustrations of all major internal Perl data structures. Having this document handy hopefully makes reading the Perl source code easier. It might also help you interpret the Devel::Peek dumps. Most of the internal perl structures had been refactored twice, with 5.10 and 5.14. The comparison links and illustrations for 5.8 - 5.20 are now included in this single document, but also available as extra files. 5.10 to 5.12 changes: only OOK.
The first things to look at are the data structures that represent Perl data; scalars of various kinds, arrays and hashes. Internally Perl calls a scalar SV (scalar value), an array AV (array value) and a hash HV (hash value). In addition it uses IV for integer value, NV for numeric value (aka double), PV for a pointer value (aka string value (char*), but 'S' was already taken), and RV for reference value. The IVs are further guaranteed to be big enough to hold a The internal relationship between the Perl data types is really object oriented. Perl relies on using C's structural equivalence to help emulate something like C++ inheritance of types. The various data types that Perl implement are illustrated in this class hierarchy diagram. The arrows indicate inheritance (IS-A relationships). As you can see, Perl uses multiple inheritance with SvNULL (also named just SV) acting as some kind of virtual base class. All the Perl types are identified by small numbers, and the internal Perl code often gets away with testing the ISA-relationship between types with the <= operator. As you can see from the figure above, this can only work reliably for some comparisons. All Perl data value objects are tagged with their type, so you can always ask an object what its type is and act according to this information. The symbolic SvTYPE names (and associated value) are with 5.14:
In addition to the simple type names already mentioned, the following names are found in the hierarchy figure: An PVIV value can hold a string and an integer value. An PVNVvalue can hold a string, an integer and a double value. The PVMG is used when magic is attached or the value is blessed. The PVLV represents a LValue object. RV is now a seperate scalar of type SVt_IV. CV is a code value, which represents a perl function/subroutine/closure or contains a pointer to an XSUB. GV is a glob value and IO contains pointers to open files and directories and various state information about these. The PVFM is used to hold information on forms. P5RX was formerly called PVBM for Boyer-Moore (match information), but contains now regex information. BIND was a unused placeholder for read-only aliases or VIEW. (#29544, #29642) INVLIST is an CORE internal inversion list object only, used for faster utf8 matching, since 5.19.2. Same layout as a PV. A Perl data object can change type as the value is modified. The SV is said to be upgraded in this case. Type changes only go down the hierarchy. (See the sv_upgrade() function in sv.c.) The actual layout in memory does not really match how a typical C++ compiler would implement a hierarchy like the one depicted above. Let's see how it is done. In the description below we use field names that match the macros that are used to access the corresponding field. For instance the _SV_HEAD and struct svThe simplest type is the "struct sv". It represents the common structure for a SV, GV, CV, AV, HV, IO and P5RX, without any struct xpv<xx> attached to it. It consist of four words, the _SV_HEAD with 3 values and the SV_U union with one pointer. The first word contains the ANY pointer to the optional body. All types are implemented by attaching additional data to the ANY pointer, just the RV not. The second word is an 32 bit unsigned integer reference counter (REFCNT) which should tell us how many pointers reference this object. When Perl data types are created this value is initialized to 1. The field must be incremented when a new pointer is made to point to it and decremented when the pointer is destroyed or assigned a different value. When the reference count reaches zero the object is freed. The third word contains a FLAGS field and a TYPE field as 32 bit unsigned integer. Since 5.10 the fourth and last HEAD word contains the sv_u union, which contains a pointer to another SV (a RV), the IV value, the PV string, the AV svu_array, a HE hash or aGP struct. The TYPE field contains a small number (0-127, mask The purpose of the SvFLAGS bits are:
The ArenaSince 5.10 SV heads and bodies are allocated in 4K arenas chunks. Heads need 4 fields, bodies are kept in unequally sized arena sets. Some types need no body (NULL, IV, RV), and some allocate only partial bodies with "ghost" fields. PL_sv_arenaroot points to the first reserved SV arena head with some private arena data, a link to the next arena, some flags, number of frees slots. PL_body_arenas is the head of the uneven sized linked-list of body arenas. SvPVA scalar that can hold a string value is called an SvPV. In addition to the SV struct of SvNULL, an xpv struct ("body") is allocated and it contains 3-4 fields. svu_pv was formerly called PVX and before 5.10 it was the first field of xpv. svu_pv/PVX is the pointer to an allocated char array. All old field names must be accessed through the old macros, which is called SvPVX(). CUR is an integer giving the current length of the string. LEN is an integer giving the length of the allocated string. The byte at (PVX + CUR) should always be '\0' in order to make sure that the string is NUL-terminated if passed to C library routines. This requires that LEN is always at least 1 larger than CUR. The POK flag indicates that the string pointed to by PVX contains an valid string value. If the POK flag is off and the ROK flag is turned on, then the PVX field is used as a pointer to an RV (see SvRV below) and the struct xpv is unused. An SvPV with both the POK and ROK flags turned off represents undef. The PVX pointer can also be NULL when POK is off and no string storage has been allocated. If the string is shared, created by sharepvn, the PVX is part of a HEK, i.e. the PVX points to the hek_key of the struct hek. Since 5.18 there is now a seperate IsCOW flag indicating that the PVX is shared as long as nobody is changing the value. The current implementation adds a COW_REFCNT byte at the aligned end of the PVX, which makes it unusable for COW in the static compiler and threads. It also requires that LEN is always at least 2 larger than CUR to keep the \0 byte. But beware: shared COWs use SvLEN=0 and set hek_len. SvPVIV and SvPVNVThe SvPVIV type is like SvPV but has an additional field to hold a single integer value called IVX in xiv_u. The IOK flag indicates if the IVX value is valid. If both the IOK and POK flag is on, then the PVX will (usually) be a string representation of the same number found in IVX. The SvPVNV type is like SvPVIV but uses the single double value called NVX in xnv_u. The corresponding flag is called NOK. SvOOKAs a special hack, in order to improve the speed of removing characters from the beginning of a string, the OOK flag is used. SvOOK_offset used to be stored in SvIVX, but is since 5.12 stored within the first 8 bit (one char) of the buffer. The PVX, CUR, LEN is adjusted to point within the allocated string instead. SvIVSince 5.10 for a raw IV (without PV) the IVX slot is in the HEAD, there is no xpviv struct ("body") allocated. The SvIVX macro abuses SvANY pointer arithmethic to point to a compile-time calculated negative offset from HEAD-1 to sv_u.svu_iv, so that PVIV and IV can use the same SvIVX macro. SvNVSince 5.10 for a raw NV (without PV) the xpvnv struct is not fully allocated, only the needed body size. SvRVThe SvRV type uses the fourth HEAD word sv_u.svu_rv as pointer to an SV (which can be any of the SvNULL subtypes), AV or HV. SvPVMGBlessed scalars or other magic attached. SvPVMG has two additional fields; MAGIC and STASH. MAGIC is a pointer to additional structures that contains callback functions and other data. If the MAGIC pointer is non-NULL, then one or more of the MAGICAL flags will be set. STASH (symbol table hash) is a pointer to a HV that represents some namespace/class/package. (That the HV represents a namespace means that the NAME field of the HV must be non-NULL. See description of HVs and stashes below). The STASH field is set when the value is blessed into a package (becomes an object). The OBJECT flag will be set when STASH is. (IMHO, this field should really have been named "CLASS". The GV and CV subclasses introduce their own unrelated fields called STASH which might be confusing.) The field MAGIC points to an instance of
The SvPVBM (old)Since 5.10 SvPVBM are really PVGVs, with the VALID flag set, and "B" magic attached. Before SvPVBM where SV objects by their own.
The SvPVBM is like SvPVMG above. I uses the
A table of 256 elements is appended to the PVX. This table contains the distance from the end of string of the last occurrence of each character in the original string. (In recent Perls, the table is not built for strings shorter than 3 character.) In addition fbm_compile() locates the rarest character in the string (using builtin letter frequency tables) and stores this character in the BmRARE field. The BmPREVIOUS field is set to the location of the first occurrence of the rare character. BmUSEFUL is incremented (decremented) by the RE engine when this constant substring (does not) help in optimizing RE engine access away. If it goes below 0, then the corresponding substring is forgotten and freed; The extra SvPVBM information and the character distance table is only valid when the VALID flag is on. A magic structure with the sole purpose of turning off the VALID flag on assignment, is always attached to a valid SvPVBM. The TAIL flag is used to indicate that the search for the SvPVMG should be tail anchored, i.e. a match should only be considered at the end of the string (or before newline at the end of the string). REGEXP (P5RX)The structures behind the P5RX, the struct regexp, store the compiled and optimized state of a perl regular expression. New here is support for pluggable regex engines - the original engine was critized ("Thompson NFA for abnormal expressions would be linear, but does not support backtracking"), non-recursive execution, and faster trie-structures for alternations. See re::engine::RE2 for the fast DFA implementation without backrefs. The struct regexp contains the compiled bytecode of the expression, some meta-information about the regex, such as the used engine, the precomp and the number of pairs of backreference parentheses. reg_data contains code and pad pointers for EXEC items in the bytecode. Since 5.11 the REGEXP is seperate from a PVMG, blessed into the "Regexp" package, with the SvANY pointing to the struct regexp, and SvPVX pointing to the string representation of the qr//.
Nobody so far did a successful freeze/thaw of those internal structures, but we have Abhijit's PM_SETRE(&pm, CALLREGCOMP(newSVpv($restring), $op->pmflags)); RX_EXTFLAGS(PM_GETRE(&pm)) = $op->reflags;
See perlreguts for some details. SvPVLVThe SvPVLV is like SvPVMG above, but has four additional fields; TARGOFF, TARGLEN, TARG, TYPE. The typical use is for Perl builtins that can be used in the LValue context (substr, vec,...). They will return an SvPVLV value, which when assigned to use magic to affect the target object, which they keep a pointer to in the TARG field. The xiv_u union is used as the GvNAME field, pointing to a namehek. The TYPE is a character variable. It encodes the kind if LValue this is. Interpretation of the other LValue fields depend on the TYPE. The SvPVLVs are (almost) always magical. The magic type will match the TYPE field of the SvPVLV. The types are:
The figure below shows an SvPVLV as returned from the When assignment to an SvPVLV type occurs, then the value to be assigned is first copied into the SvPVLV itself (and affects the PVX, IVX or NVX). After this the magic SET method is invoked, which will update the TARG accordingly. AVAn array is in many ways represented similar to strings. An AV contains all the fields of SvPVMG, but not more. Some fields of xpvav and sv have been renamed. ARYLEN uses the MAGIC field, to point to a magic SV (which is returned when The previous extra FLAGS field in the xpvav has been merged into the sv_flags field. The array pointed to by ARRAY contains pointers to any of the SvNULL subtypes. Usually ALLOC and ARRAY both point to the start of the allocated array. The use of two pointers is similar to the OOK hack described above. The shift operation can be implemented efficiently by just adjusting the ARRAY pointer (and FILL/MAX). Similarly, the pop just involves decrementing the FILL count. There are only 2 array flags defined:
HVHashes are the most complex of the Perl data types. In addition to what we have seen above, the very last index in the HE*[] points to a new xpvhv_aux struct. HVs use HEstructs to represent "hash element" key/value pairs and HEK structs to represent "hash element keys".
The first few fields of the xpvhv have been renamed in the same way as for AVs. MAX is the number of elements in ARRAY minus one. (The size of the ARRAY is required to be a power of 2, since the code that deals with hashes just mask off the last few bits of the HASH value to locate the correct HE column for a key: The HEs are simple structs containing 3 pointers. A pointer to the next HE, a pointer to the key and a pointer to the value of the given hash element. The HEKs are special variable sized structures that store the hash keys. They contain 4 fields. The computed hash value of the string, the length of the string, len+1 bytes for the key string itself (including trailing NUL), and a trailing byte for HEK_FLAGS (since 5.8). As a special case, a len value of In a perfect hash both KEYS and FILL are the same value. This means than all HEs can be located directly from the pointer in the ARRAY (and all the he->next pointers are NULL). The following two hash specific flags are found among the common SvNULL flags:
GVGV ("glob value" aka "symbol") shares the same structure as the SvPVMG. The GP is a pointer to structure that holds pointers to data of various kinds. Perl use a pointer, instead of including the GP fields in the xpvgv, in order to implement the proper glob aliasing behavior (i.e. different GVs can share the same GP). The NAMEHEK denotes the unqualified name of this symbol and GvSTASH points to the symbol table where this symbol belongs. The fully qualified symbol name is obtained by taking the NAME of the GvSTASH (see HV above) and appending "::" and NAME to it. The hash pointed to by GvSTASH will usually contain an element with NAME as key and a pointer to this GV as value. See description of stashes below. A magic of type '*' is always attached to the GV (not shown in the figure). The magic GET method is used to stringify the globs (as the fully qualified name prefixed with '*'). The magic SET method is used to alias an GLOB based on the name of another glob.
GPGPs can be shared between one or more GVs. The data type fields for the GP are: SV, IO, FORM, AV, HV, CV. These hold a pointer to the corresponding data type object. (The SV must point to some simple SvNULL subtype (i.e. with type <= SVt_PVLV). The FORM field must point to a SvPVFM if non-NULL. The IO field must point to an IO if non-NULL, the AV to an AV, etc.) The SV is always present (but might point to a SvNULL object). All the others are initially NULL. The additional administrative fields in the GP are: CVGEN, REFCNT, EGV, LINE, FILE_HEK. REFCNT is a reference counter. It says how many GVs have a pointer to this GP. It is incremented/decremented as new GVs reference/forget this GP. When the counter reach 0 the GP is freed. EGV, the "effective gv", if *glob, is a pointer to the GV that originally created this GP (used to tell the real name of any aliased symbol). If the original GV is freed, but GP should stay since another GV reference it, then the EGV is NULLed. CVGEN is an integer used to validate method cache CV entries in the GP. If CVGEN is zero, then the CV is real. If CVGEN is non-zero, but less than the global variablesubgeneration, then the CV contains a stale method cache entry. If CVGEN is equal to subgeneration then the CV contains a valid method cache entry. FILE_HEK is the name of the file where this symbol was first created. LINE is the corresponding line number in the file. StashesGVs and stashes work together to implement the name spaces of Perl. Stashes are named HVs with all the element values being pointers to GVs. The root of the namespace is pointed to by the global variable In the figure below we have simplified the representation of stashes to a single box. The text in the blue field is the NAME of the HV/stash. The hash elements keys are shown as field names and the element values are shown as a pointers to globs (GV). The GVs are also simplified to a single box. The text in the green field in the fully qualified name of the GV. Only the GP data fields are shown (and FORM has been eliminated because it was not 2 letters long :-). The figure illustrates how the scalar variables All resolution of qualified names starts with the stash pointed to by the As you can see from this figure, there are lots of pointers to dereference in order to look up deeply nested names. Each stash is at least 4 levels deep and each glob is 3 levels, giving at least 24 pointer dereferences to access the data in the The CVThe CV ("code value") is like SvPVMG above, but has some renamed and additional fields; CvSTASH, START, ROOT, GV, FILE, DEPTH, PADLIST, OUTSIDE, OUTSIDE_SEQ, CvFLAGS. The
DEPTH and PADLIST are needed to access and check the current scratchpad. Lexicals are accessed by the OP->targ index into the PADLIST. SvPVFMThe SvPVFM is like CV above, but adds a single field called LINES. IOThe IO is like SvPVMG above, but has quite a few additional fields. IoFLAGS
PADA |
请发表评论