Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
636 views
in Technique[技术] by (71.8m points)

etl - linq2db - server side bulkcopy

I'm trying to do a "database side" bulk copy (i.e. SELECT INTO/INSERT INTO) using linq2db. However, my code is trying to bring the dataset over the wire which is not possible given the size of the DB in question.

My code looks like this:

using (var db = new MyDb()) {
    var list = db.SourceTable.
        Where(s => s.Year > 2012).
        GroupBy(s => new { s.Column1, s.Column2 }).
        Select(g => new DestinationTable {
            Property1 = 'Constant Value',
            Property2 = g.First().Column1,
            Property3 = g.First().Column2,
            Property4 = g.Count(s => s.Column3 == 'Y')
        });

    db.Execute("TRUNCATE TABLE DESTINATION_TABLE");
    db.BulkCopy(new BulkCopyOptions {
      BulkCopyType = BulkCopyType.MultipleRows
    }, list);
}

The generated SQL looks like this:

BeforeExecute
-- DBNAME SqlServer.2017

TRUNCATE TABLE DESTINATION_TABLE
 DataConnection
Query Execution Time (AfterExecute): 00:00:00.0361209. Records Affected: -1.
 DataConnection
BeforeExecute
-- DBNAME SqlServer.2017
DECLARE @take Int -- Int32
SET     @take = 1
DECLARE @take_1 Int -- Int32
SET     @take_1 = 1
DECLARE @take_2 Int -- Int32
...
SELECT
        (
                SELECT TOP (@take)
                        [p].[YEAR]
                FROM
                        [dbo].[SOURCE_TABLE] [p]
                WHERE
                        (([p_16].[YEAR] = [p].[YEAR] OR [p_16].[YEAR] IS NULL AND [p].[YEAR] IS NULL) AND ...
...)
FROM SOURCE_TABLE p_16
WHERE p_16.YEAR > 2012
GROUP BY
  ...
DataConnection

That is all that is logged as the bulkcopy fails with a timeout, i.e. SqlException "Execution Timeout Expired".

Please note that running this query as an INSERT INTO statement takes less than 1 second directly in the DB.

PS: Anyone have any recommendations as to good code based ETL tools to do large DB (+ 1 TB) ETL. Given the DB size I need things to run in the database and not bring data over the wire. I've tried pyspark, python bonobo, c# etlbox and they all move too much data around. I thought linq2db had potential, i.e. basically just act like a C# to SQL transpiler but it is also trying to move data around.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I would suggest to rewrite your query because group by can not return first element. Also Truncate is a part of the library.

var sourceQuery = 
   from s in db.SourceTable 
   where s.Year > 2012
   select new 
   {
      Source = s,
      Count = s.Sql.Ext.Count(s.Column3 == 'Y' ? 1 : null).Over()
        .PartitionBy(s.Column1, s.Column2).ToValue()
      RN = Sql.Ext.RowNumber().Over()
        .PartitionBy(s.Column1, s.Column2).OrderByDesc(s.Year).ToValue()
   };

db.DestinationTable.Truncate();

sourceQuery.Where(s => s.RN == 1)
  .Insert(db.DestinationTable, 
    e => new DestinationTable 
    {
       Property1 = 'Constant Value',
       Property2 = e.Source.Column1,
       Property3 = e.Source.Column2,
       Property4 = e.Count
    });

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...