Friday, March 30, 2012

Merge Join vs. Lookup vs. Custom Script - which is fastest?

Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?Have you run your own tests?

Take your source and throw it through each of the above options and finally into a row counter. Compare the time it takes to get through the whole dataflow.|||

TheViewMaster wrote:

Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?

Only you can answer that question. test and measure test and measure, test and emasure.

-Jamie

|||Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?|||

TheViewMaster wrote:

Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?

OK. Well I am loath to give my opinions on performance comparisons but I'd lay alot of money to say that script transform will be slowest.

-Jamie

|||

If you do test the three methods, please post the results here. I am using custom script for lookups (small reference lists but millions of source rows in pipeline) but I would like to know how large reference lists perform.

|||For those posting to this thread and reading it, please watch the Webcast presented by Donald Farmer on performance and scale in SSIS. In there Donald talks about benchmarking and how to set up SSIS to obtain timings associated with different aspects of a package.

TechNet Webcast: SQL Server 2005 Integration Services: Performance and Scale (Level 400)
MS TechNet Event ID: 1032298087

I don't know if this link will work for anyone:

https://msevents.microsoft.com/CUI/Register.aspx?culture=en-US&EventID=1032298087&CountryCode=US&IsRedirect=false|||Where can I report a BUG about this forum - I have a 50/50 chance that when I try to create a hyperlink in my post - the Firefox crashes.
(Thank god I copied and pasted the following post to notepad before "doing the hyperlink trick")|||So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.

Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usage

My Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm|||

TheViewMaster wrote:

So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.

Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usage

My Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm

Excellent stuff. This is really valuable information. Thank you. I've updated my post with a link to here.

|||

Yes thanks for posting very interesting info. Today I am going to change all my script lookups to use stringbuilder class and methods (strongly recommended in all the .net literature where performance is important when modifyng strings). Currently all my lookup script transforms use object based .net string variables which are notoriously terrible performers when the string values are repeatedly modified. Do you know which approach your script transform used? (assuming your are creating and modifying string variables in your lookup script)...

If I detect the same low processor usage in my script lookups I may also try and partition the pipeline to get a lookup to run with multiple threads...

Ken

|||My script does a lookup something similar to as described aforementioned Van Mullem article:

Public Overrides Sub

PreExecute()

sqlCmd = New

SqlCommand("SELECT KeyCustomer, CustomerName

FROM tblCustomer WHERE(KeyCustomer = @.KeyCustomer)", sqlConn)

sqlParam = New

SqlParameter("@.KeyCustomer",

SqlDbType.Int)

sqlCmd.Parameters.Add(sqlParam)

End Sub

Public Overrides Sub

CustomerRecordsInput_ProcessInputRow(ByVal Row As CustomerRecordsInputBuffer)

Dim

reader As SqlDataReader

sqlCmd.Parameters("@.KeyCustomer").Value = Row.CUNO

reader = sqlCmd.ExecuteReader()

If

reader.Read() Then

Row.DirectRowToUpdateRecordsOutput()

Else

Row.DirectRowToInsertRecordsOutput()

End If

reader.Close()

End Sub

|||Ken - is your script performing a lookup from another source in pipeline?
<boy i'd like to know how to do that>

Also - any suggestions how to improve performance of OLE DB Update command?|||

Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

A non cached lookup will be expremely slow as was your script component.

|||

Crispin wrote:

Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

A non cached lookup will be expremely slow as was your script component.

It would be best to try to replicate full caching in the script component. The purpose of the exercise was to see which was faster. So, we know how fast (and legitimately so) the lookup component was, now how fast can we get the script component to process?

The question is how fast can each of the elements process their data, not how slow can we make them work.

Phil

No comments:

Post a Comment