Showing posts with label package. Show all posts
Showing posts with label package. Show all posts

Friday, March 30, 2012

Merge Join's poor performance

Hello All,

I'm experiencing performance problems with the merge join task.
Every time I'm building a nice package using this task, I'm ending up deleting it and using SQL statement in the OLE DB source to accomplish the join since it takes forever to run and crushing my computer at the process.
It makes me feel I don't use the abilities SSIS has to offer compared to DTS.
Of course for the use of several thousands of records it works fine, but in a production surrounding with hundred of thousands of rows, it seems to be futile.

Maybe someone had a little more luck with it?

Liran R wrote:

Hello All,

I'm experiencing performance problems with the merge join task.
Every time I'm building a nice package using this task, I'm ending up deleting it and using SQL statement in the OLE DB source to accomplish the join since it takes forever to run and crushing my computer at the process.
It makes me feel I don't use the abilities SSIS has to offer compared to DTS.
Of course for the use of several thousands of records it works fine, but in a production surrounding with hundred of thousands of rows, it seems to be futile.

Maybe someone had a little more luck with it?

If I were you I would use the OLE DB Source component to do the join - there is absolutely nothing wrong with doing that. If you have a super-performant relational databsae engine at your disposal - why not use it?

Donald Farmer talks around this a little in his OVAL webcast. If you only watch one SSIS webcast in your life then it should be this one.

Donald Farmer's Technet webcast
(http://blogs.conchango.com/jamiethomson/archive/2006/06/14/4076.aspx)

-Jamie

|||Thanks, I'll take a look. does the merge join takes place in the cache?|||

Can you elaborate as to what you mean by "the cache"?

-Jamie

|||

Sure. When I'm using a Lookup component to make the join, The lookup table is been cached by default. When I want to join to a large table (with several millions) I understood It's best practice to use the Merge Join Component, but as I see also this component caches the records in the machines memory, so I don't understand the benefit...

|||

There's a different sort of memory usage going on here.

Yes, the LOOKUP has an area of memory that we call the cache and is used for storing a lookup set.

MERGE JOIN stores data in memory (as do all asynchronous components) but its a different kind of memory and we don't refer to it as a cache. it is just the component's working area - more commonly termed a buffer. Also, this working area will change as MERGE JOIN does its work whereas the LOOKUP cache is static.

It is also worth saying that the MERGE JOIN can spool data to disk if it s running out of memory. LOOKUP cannot do that with its cache.

-Jamie

|||

Hi SSIS friend,

I remember asking a similar question a couple of months back. I think I had 15+ Merge Join components in a test package and the performance was awful. Jamie's advice then and now is pretty sound.

During my short period experimenting with SSIS, I came to realise that in order to create efficient packages, I had to utilise the power of both the SQL Server and SSIS engines. Each one is good at performing certain tasks better than the other. It takes time and I'm still learning, but the more you play around with it, the easier it gets to choose which engine should be used.

|||

Thanks guys.

sql

Merge join: nr of output rows unchanged when amount of input changes

Dear all,

I created a package that seems to work fine with a small amount of data. When I run the package however with more data (as in production) the merge join output is limites to 9963 rows, no matter if I change the number of input rows.

Situation as follows.

The package has 2 OLE DB Sources, in which SQL-statements have been defined in order to retrieve the data.

The flow of source 1 is: retrieving source data -> trimming (non-key) columns -> sorting on the key-columns.

The flow of source 2 is: retrieving source data -> deriving 2 new columns -> aggregating the data to the level of source 1 -> sorting on the key columns.

Then both flows are merged and other steps are performed.

If I test with just a couple of rows it works fine. But when I change the where-clause in the data source retrieval, so that the number of rows is for instance 15000 or 150000 the number of rows after the merge join is 9963.

When I run the package in debug-mode the step is colored green, nevertheless an error is displayed:

Error: 0xC0047022 at Data Flow Task, DTS.Pipeline: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component "Merge Join" (4703) failed with error code 0xC0047020. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure.

To be honest, a few more errormessages appear, but they don't seem related to this issue. The package stops running after some 6000 rows have been written to the destination.

Any help will be greatly appreciated.

Kind regards,

Albert.

If you could post the full error output here, that would probably be helpful. Sometimes it is the "big view" that helps point you at the cause of the problem, especially since many of SSIS's error messages are not particularly transparent. The earlier errors often show what triggered the later errors, even if they do not appear directly related.

|||Can you also include details of the next task in the pipeline, the one that accepts the ~9000 rows. It sounds like it's failing on the first buffer it receives as input.|||

Ok, for the big picture: I feel like a fool.

Solving one of the other errors solved the reported issue as well.

Apologies for bothering you.

Merge Join's poor performance

Hello All,

I'm experiencing performance problems with the merge join task.
Every time I'm building a nice package using this task, I'm ending up deleting it and using SQL statement in the OLE DB source to accomplish the join since it takes forever to run and crushing my computer at the process.
It makes me feel I don't use the abilities SSIS has to offer compared to DTS.
Of course for the use of several thousands of records it works fine, but in a production surrounding with hundred of thousands of rows, it seems to be futile.

Maybe someone had a little more luck with it?

Liran R wrote:

Hello All,

I'm experiencing performance problems with the merge join task.
Every time I'm building a nice package using this task, I'm ending up deleting it and using SQL statement in the OLE DB source to accomplish the join since it takes forever to run and crushing my computer at the process.
It makes me feel I don't use the abilities SSIS has to offer compared to DTS.
Of course for the use of several thousands of records it works fine, but in a production surrounding with hundred of thousands of rows, it seems to be futile.

Maybe someone had a little more luck with it?

If I were you I would use the OLE DB Source component to do the join - there is absolutely nothing wrong with doing that. If you have a super-performant relational databsae engine at your disposal - why not use it?

Donald Farmer talks around this a little in his OVAL webcast. If you only watch one SSIS webcast in your life then it should be this one.

Donald Farmer's Technet webcast
(http://blogs.conchango.com/jamiethomson/archive/2006/06/14/4076.aspx)

-Jamie

|||Thanks, I'll take a look. does the merge join takes place in the cache?|||

Can you elaborate as to what you mean by "the cache"?

-Jamie

|||

Sure. When I'm using a Lookup component to make the join, The lookup table is been cached by default. When I want to join to a large table (with several millions) I understood It's best practice to use the Merge Join Component, but as I see also this component caches the records in the machines memory, so I don't understand the benefit...

|||

There's a different sort of memory usage going on here.

Yes, the LOOKUP has an area of memory that we call the cache and is used for storing a lookup set.

MERGE JOIN stores data in memory (as do all asynchronous components) but its a different kind of memory and we don't refer to it as a cache. it is just the component's working area - more commonly termed a buffer. Also, this working area will change as MERGE JOIN does its work whereas the LOOKUP cache is static.

It is also worth saying that the MERGE JOIN can spool data to disk if it s running out of memory. LOOKUP cannot do that with its cache.

-Jamie

|||

Hi SSIS friend,

I remember asking a similar question a couple of months back. I think I had 15+ Merge Join components in a test package and the performance was awful. Jamie's advice then and now is pretty sound.

During my short period experimenting with SSIS, I came to realise that in order to create efficient packages, I had to utilise the power of both the SQL Server and SSIS engines. Each one is good at performing certain tasks better than the other. It takes time and I'm still learning, but the more you play around with it, the easier it gets to choose which engine should be used.

|||

Thanks guys.

Merge Join

Hello all,

I have a package where I use merge join for two sorted inputs and the output is stored in a raw file.

In another package, the raw file from above package is again merge joined with another sorted input. Now my question is....do we need to sort again the raw file from first package? or is it OK to set the isSorted property to True and define the sort keys?

Thank you.

As long as you know the input is sorted, you can use the issorted property.

If it turns out not to be sorted, you could miss joins.

BobP

|||

Thank you for the reply...

So here in my case I am using a left join (in merge join)....how would I know if the raw file output from first package i sorted or not?

There is no problem to sort it again but really takes a lot of time(as the size is too big).

|||

If you are sorting it and then writing it out, it should stay sorted.

BobP

|||Ok....if it stays sorted then OK.......many Thanks

Wednesday, March 7, 2012

Memory problems & package performance

Hi,

I know the SSIS memory problem has probably been covered quite a bit, but being a newbie in using the SSIS, I'm not quite sure how to improved the performance of my SSIS package.

Basically, I've a package that loops through all the subdirectories within a specified directory, and it then loops through each file in the subdirectory and with the use of the Data Flow, process each file (according to their filenames) with a Script Component to insert data into a SQL DB.

Each subdirectory has up to 15 different csv files, but each is less than 5kB. I probably have about 100 subdirectories.

When I run the package, it functioned properly, but the package stalled (no error but just stuck in one Data Flow) after a while, and when I checked my CPU memory, it was running at 100%.

I'm not sure how I could fix it or improved the memory allocation. I was not expecting to have any memory problems as the file size is small and the number of rows of data going into and out of the Script Component is no more than 20.

Any advice? Thanks.

Just an update, I observed that running my package from the Visual Studio (development environment) takes up a large chunk of memory. So, I close VS and run the package through DTEXEC.

From the Windows Task Manager, I observe that the memory usage for DTEXEC.exe keeps increasing, so I'm guessing that all the previous process is still being stored in the memory. Is there a way I could flush the memory each time I process a new file, as I don't need the memory to keep data that has been processed.

I'm not keen to increase the PC memory size as a solution as I don't think the SSIS memory should keep increasing. Any suggestions?