For this to work, you must have already the table dir_struct
(from your previous question) so that the file names can be inserted to the table with the id of the directory they belong to.
First I create the new table files
:
CREATE TABLE files(
FileDataID INTEGER REFERENCES listfile(FileDataID),
Directory_Parent INTEGER REFERENCES dir_struct(Directory),
Value
);
You must also create a unique index for FileDataID
in listfile
, because it is not defined as the PRIMARY KEY
or UNIQUE
, so columns in other tables (like the column FileDataID
of files
) can reference it.
CREATE UNIQUE INDEX idx_listfile_FileDataID ON listfile(FileDataID);
A recursive CTE
is used to query dir_struct
and build all the possible paths and it is joined to listfile
to match the file names and their paths:
WITH cte AS (
SELECT Directory, Directory_Parent, Value, '' full_path
FROM dir_struct
WHERE Directory = 0
UNION ALL
SELECT d.Directory, d.Directory_Parent, d.Value, full_path || d.Value || '/'
FROM dir_struct d INNER JOIN cte c
ON c.Directory = d.Directory_Parent
)
INSERT INTO files(FileDataID, Directory_Parent, Value)
SELECT f.FileDataID, c.Directory, SUBSTR(f.Path, LENGTH(c.full_path) + 1)
FROM listfile f INNER JOIN cte c
ON f.Path LIKE c.full_path || '%' AND INSTR(SUBSTR(f.Path, LENGTH(c.full_path) + 1), '/') = 0
See the demo, where the code for the insertions in dir_struct
has also been modified because now the table listfile
contains files at the root, which did not exist in the sample data of your previous question.
So the code in the demo must be executed as a whole.
I used your 1MB sample data and the queries ran very fast.
But, for 1M rows (from the link you first posted), which I also tested (and found duplicates which you must delete before doing anything else), the creation of the table files
took about 1.5 hour.
As I mentioned in my answer of your previous question, if this is a one time thing then use it. If you will need it frequently, then you should consider something else.