Friday, March 30, 2012

Merge or Transaction Replication?

I have a branch office I would like to connect to our database, which is
fronted by an Access ADP project. I'm having problems with security (the old
runtime 17), but in other small tests it seems the system is very slow. I
guess I'm not really that surprised.
I am considering using replication to set up a second server. Our workflow
is fairly simple. There are many tables that are effectively read-only, and
could be replicated using any method. There is one table that will see
updates from the remote site, as well as a trigger that fires when that table
is updated (keeping an audit log).
Can anyone suggest the best method for this case? It would seem that Merge
replication would be fine, but the documentation suggests I will have
problems with ACID on merge, and little else. Transactional sounds fine,
there's only about 5 tables that see updates on any sort of daily basis, and
only the one orders table that really gets hit.
However that table uses a auto-incrementing pkey. Will that even work?
MauryFor uni-directional data flow transactional replication will work best.
Identity values will be assigned on the publisher/data source. If you are
replicating to a table which has the identity property on this column you
will need to use the not for replication switch. By default Transactional
replication will not put the identity property on columns on the
subscriber/data destination.
Merge is intended for clients which are frequently offline and when you need
bi-directional replication. With careful partitioning you will avoid the
collisions which will break the d part of ACID.
If you need bi-directional replication you can use bi=-directional
transactional replication.
--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Maury Markowitz" <MauryMarkowitz@.discussions.microsoft.com> wrote in
message news:6016EF86-8F9A-4D0D-9C83-CF08AB06107A@.microsoft.com...
> I have a branch office I would like to connect to our database, which is
> fronted by an Access ADP project. I'm having problems with security (the
old
> runtime 17), but in other small tests it seems the system is very slow. I
> guess I'm not really that surprised.
> I am considering using replication to set up a second server. Our workflow
> is fairly simple. There are many tables that are effectively read-only,
and
> could be replicated using any method. There is one table that will see
> updates from the remote site, as well as a trigger that fires when that
table
> is updated (keeping an audit log).
> Can anyone suggest the best method for this case? It would seem that Merge
> replication would be fine, but the documentation suggests I will have
> problems with ACID on merge, and little else. Transactional sounds fine,
> there's only about 5 tables that see updates on any sort of daily basis,
and
> only the one orders table that really gets hit.
> However that table uses a auto-incrementing pkey. Will that even work?
> Maury|||"Hilary Cotter" wrote:
> For uni-directional data flow transactional replication will work best.
No good here, one of the tables WILL be bi-directional.
> Identity values will be assigned on the publisher/data source.
This is a bit confusing. In my case I am sharing one updating table. If I
turn this off, does this mean that a new row will always be assigned a
"local" identity, one that might be different on the remote machine?
Maurysql

Merge or Transaction Replication?

I have a branch office I would like to connect to our database, which is
fronted by an Access ADP project. I'm having problems with security (the old
runtime 17), but in other small tests it seems the system is very slow. I
guess I'm not really that surprised.
I am considering using replication to set up a second server. Our workflow
is fairly simple. There are many tables that are effectively read-only, and
could be replicated using any method. There is one table that will see
updates from the remote site, as well as a trigger that fires when that table
is updated (keeping an audit log).
Can anyone suggest the best method for this case? It would seem that Merge
replication would be fine, but the documentation suggests I will have
problems with ACID on merge, and little else. Transactional sounds fine,
there's only about 5 tables that see updates on any sort of daily basis, and
only the one orders table that really gets hit.
However that table uses a auto-incrementing pkey. Will that even work?
Maury
For uni-directional data flow transactional replication will work best.
Identity values will be assigned on the publisher/data source. If you are
replicating to a table which has the identity property on this column you
will need to use the not for replication switch. By default Transactional
replication will not put the identity property on columns on the
subscriber/data destination.
Merge is intended for clients which are frequently offline and when you need
bi-directional replication. With careful partitioning you will avoid the
collisions which will break the d part of ACID.
If you need bi-directional replication you can use bi=-directional
transactional replication.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Maury Markowitz" <MauryMarkowitz@.discussions.microsoft.com> wrote in
message news:6016EF86-8F9A-4D0D-9C83-CF08AB06107A@.microsoft.com...
> I have a branch office I would like to connect to our database, which is
> fronted by an Access ADP project. I'm having problems with security (the
old
> runtime 17), but in other small tests it seems the system is very slow. I
> guess I'm not really that surprised.
> I am considering using replication to set up a second server. Our workflow
> is fairly simple. There are many tables that are effectively read-only,
and
> could be replicated using any method. There is one table that will see
> updates from the remote site, as well as a trigger that fires when that
table
> is updated (keeping an audit log).
> Can anyone suggest the best method for this case? It would seem that Merge
> replication would be fine, but the documentation suggests I will have
> problems with ACID on merge, and little else. Transactional sounds fine,
> there's only about 5 tables that see updates on any sort of daily basis,
and
> only the one orders table that really gets hit.
> However that table uses a auto-incrementing pkey. Will that even work?
> Maury
|||"Hilary Cotter" wrote:

> For uni-directional data flow transactional replication will work best.
No good here, one of the tables WILL be bi-directional.

> Identity values will be assigned on the publisher/data source.
This is a bit confusing. In my case I am sharing one updating table. If I
turn this off, does this mean that a new row will always be assigned a
"local" identity, one that might be different on the remote machine?
Maury

Merge or Transaction Replication?

I have a branch office I would like to connect to our database, which is
fronted by an Access ADP project. I'm having problems with security (the old
runtime 17), but in other small tests it seems the system is very slow. I
guess I'm not really that surprised.
I am considering using replication to set up a second server. Our workflow
is fairly simple. There are many tables that are effectively read-only, and
could be replicated using any method. There is one table that will see
updates from the remote site, as well as a trigger that fires when that tabl
e
is updated (keeping an audit log).
Can anyone suggest the best method for this case? It would seem that Merge
replication would be fine, but the documentation suggests I will have
problems with ACID on merge, and little else. Transactional sounds fine,
there's only about 5 tables that see updates on any sort of daily basis, and
only the one orders table that really gets hit.
However that table uses a auto-incrementing pkey. Will that even work?
MauryFor uni-directional data flow transactional replication will work best.
Identity values will be assigned on the publisher/data source. If you are
replicating to a table which has the identity property on this column you
will need to use the not for replication switch. By default Transactional
replication will not put the identity property on columns on the
subscriber/data destination.
Merge is intended for clients which are frequently offline and when you need
bi-directional replication. With careful partitioning you will avoid the
collisions which will break the d part of ACID.
If you need bi-directional replication you can use bi=-directional
transactional replication.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Maury Markowitz" <MauryMarkowitz@.discussions.microsoft.com> wrote in
message news:6016EF86-8F9A-4D0D-9C83-CF08AB06107A@.microsoft.com...
> I have a branch office I would like to connect to our database, which is
> fronted by an Access ADP project. I'm having problems with security (the
old
> runtime 17), but in other small tests it seems the system is very slow. I
> guess I'm not really that surprised.
> I am considering using replication to set up a second server. Our workflow
> is fairly simple. There are many tables that are effectively read-only,
and
> could be replicated using any method. There is one table that will see
> updates from the remote site, as well as a trigger that fires when that
table
> is updated (keeping an audit log).
> Can anyone suggest the best method for this case? It would seem that Merge
> replication would be fine, but the documentation suggests I will have
> problems with ACID on merge, and little else. Transactional sounds fine,
> there's only about 5 tables that see updates on any sort of daily basis,
and
> only the one orders table that really gets hit.
> However that table uses a auto-incrementing pkey. Will that even work?
> Maury|||"Hilary Cotter" wrote:

> For uni-directional data flow transactional replication will work best.
No good here, one of the tables WILL be bi-directional.

> Identity values will be assigned on the publisher/data source.
This is a bit confusing. In my case I am sharing one updating table. If I
turn this off, does this mean that a new row will always be assigned a
"local" identity, one that might be different on the remote machine?
Maury

Merge on MSDE: how can do it ?

Hello everybody,
i need to set a MSDE database as 'publisher' for merge replication.
My difficult is that I haven't understand how I cand do it.
I have not Enterprise Manager for set the merge, How I can Do it ?
Thank you so much,
You could create a merge publication on your normal server and get EM to
script it out. This would be in the form of a series of system stored
procedure calls. After changing the servername, you should have a script
that could be run on MSDE.
Alternatively, you could create it using SQLDMO. I don't have an example of
merge, but I do have 2 examples of transactional on
www.replicationanswers.com.
Rgds,
Paul Ibison SQL Server MVP, www.replicationanswers.com
(recommended sql server 2000 replication book:
http://www.nwsu.com/0974973602p.html)
|||"Paul Ibison" <Paul.Ibison@.Pygmalion.Com> ha scritto nel messaggio news:%23eBzl8WAFHA.2016@.TK2MSFTNGP15.phx.gbl...
Thank you for your answer,
but i need a tool that make it for me.
There are any open-source/free software that i can use ?
thank you!
ck1;
|||I had a browse of the 3rd party tools for managing MSDE,
but the 3 I looked at didn't list replication
administration. Scripting out a subscriuption is not
difficult, but I can appreciate the functionality you are
looking for and I'll also keep an eye on this thread to
see if anyone else has such a GUI tool.
Rgds,
Paul Ibison SQL Server MVP, www.replicationanswers.com
(recommended sql server 2000 replication book:
http://www.nwsu.com/0974973602p.html)
|||What language? I have a library (VB6 and .NET versions) to set up merge
replication programmatically that's been used in a couple of client projects
successfully..
GUI can be put on it relatively easily.
Paul
"Paul Ibison" <Paul.Ibison@.Pygmalion.Com> wrote in message
news:26e001c502c7$94d45ce0$a401280a@.phx.gbl...
> I had a browse of the 3rd party tools for managing MSDE,
> but the 3 I looked at didn't list replication
> administration. Scripting out a subscriuption is not
> difficult, but I can appreciate the functionality you are
> looking for and I'll also keep an eye on this thread to
> see if anyone else has such a GUI tool.
> Rgds,
> Paul Ibison SQL Server MVP, www.replicationanswers.com
> (recommended sql server 2000 replication book:
> http://www.nwsu.com/0974973602p.html)
>
sql

Merge Module with Verbose Logging?

Is this possible?
When troubleshooting MSDE install issues in the field, verbose logging
would be nice. But as far as I can see, it's only available for the
standalone install (/L*v), and not the merge module.
I should probably add that my install is an Installshield-based
compressed setup.exe.

Merge Metadata Cleanup

Hello Everyone,
I am running SQL2K with sp3, continuous merge replication, only one
subscriber.
My current setting for Subscription expiration is the default of 7, I
have just recently changed this from the default of 14 days, as our
Subscriber is basically always available.
I am still trying to figure out how to cleanup my merge metadata. From
what I have read and from what I understand ,
sp_mergemetadataretentioncleanup is automatically run, if the agent
running Merge has the Parameter for MetadataRententionCleanup set to 1
I am under the assumption that if I have has the Parameter for
MetadataRententionCleanup set to 1 it should automatically be cleaning
up my metadata. Even though I have it set to 1 it does not appear
like it is actually cleaning out MSmerge_contents or MSmerge_tombstone.
These tables just keep growing.
Now if I had this parameter set to 0 then I would need to use
sp_mergecleanupmetadata to clean up MSmerge_contents or
MSmerge_tombstone.
Would there ever be a reason for running both?
I did go to BOL but I read the following about using
sp_mergecleanupmetadata:
If you want to run sp_mergecleanupmetadata without the subscriptions
being marked for reinitialization:
Stop all updates to the publication and subscription databases.
Unfortunately, I cannot stop all updates as our servers need to be
available 24/7.
Does anyone run this without stopping replication first? And if you do
run it do you have MetadataRententionCleanup set to 0?
If anyone can help I truly would appreciate it.
Thanks,
Barbara
*** Sent via Developersdex http://www.codecomments.com ***
are you using anonymous subscribers? Merge replication does not clean
up the metadata as efficiently with anonymous subscrptions as opposed
to named subscriptions.
|||Thanks Hilary.
Unfortunately no, we do not have any anonymous subscriptions only one
named subscriptions.
Any other ideas?
Barbara
*** Sent via Developersdex http://www.codecomments.com ***
|||I believe I have my answer:
1) sp_mergecleanupmetadata
2) sp_mergemetadataretentioncleanup
The First should be used for topologies with SQL Server 2000 SP1 or less
(which is not my case)
The Second one is the one to control the process manually for SQL Server
> SP1 which is your case and when the MetadataretentionCleanup is set to
0 (not automatic)
There is also a catch though the second sp is executed automatically by
merge Agent at at startup! when MetadataretentionCleanup is set to 1
therefore when in continuous mode it will be called only once.
So I'll need to manually run sp_mergemetadataretentioncleanup or setup a
job that will do it for me on a scheduled basis.
Thanks for your help Hilary.
Barbara
*** Sent via Developersdex http://www.codecomments.com ***

Merge Large Tables

What is the best way to merge two really large tables (7,000,000 rows each,
and very wide)?
The table definitions are the same for both tables.
Right now I am using a insert into statement with a selection from one table
at a time, but it takes way too long. I don't need it to be logged. A all
or nothing result is fine for me. I don't think DTS is an option because I
need to run this from a C# app. I think my only other option is using the
BCP api to select the data and load it into the new table, but this just
seems like the wrong way to go.
Any other ways to go with this?
Shawn
Select *
into [NewTable]
From
Select * [Table1]
Union All
Select * [Table2]
"Shawn Meyer" <me@.me.com> wrote in message
news:O4GhS6QQFHA.580@.TK2MSFTNGP15.phx.gbl...
> What is the best way to merge two really large tables (7,000,000 rows
> each,
> and very wide)?
> The table definitions are the same for both tables.
> Right now I am using a insert into statement with a selection from one
> table
> at a time, but it takes way too long. I don't need it to be logged. A all
> or nothing result is fine for me. I don't think DTS is an option because
> I
> need to run this from a C# app. I think my only other option is using the
> BCP api to select the data and load it into the new table, but this just
> seems like the wrong way to go.
> Any other ways to go with this?
> Shawn
>

Merge Large Tables

What is the best way to merge two really large tables (7,000,000 rows each,
and very wide)?
The table definitions are the same for both tables.
Right now I am using a insert into statement with a selection from one table
at a time, but it takes way too long. I don't need it to be logged. A all
or nothing result is fine for me. I don't think DTS is an option because I
need to run this from a C# app. I think my only other option is using the
BCP api to select the data and load it into the new table, but this just
seems like the wrong way to go.
Any other ways to go with this?
ShawnSelect *
into [NewTable]
From
Select * [Table1]
Union All
Select * [Table2]
"Shawn Meyer" <me@.me.com> wrote in message
news:O4GhS6QQFHA.580@.TK2MSFTNGP15.phx.gbl...
> What is the best way to merge two really large tables (7,000,000 rows
> each,
> and very wide)?
> The table definitions are the same for both tables.
> Right now I am using a insert into statement with a selection from one
> table
> at a time, but it takes way too long. I don't need it to be logged. A all
> or nothing result is fine for me. I don't think DTS is an option because
> I
> need to run this from a C# app. I think my only other option is using the
> BCP api to select the data and load it into the new table, but this just
> seems like the wrong way to go.
> Any other ways to go with this?
> Shawn
>

Merge Join's poor performance

Hello All,

I'm experiencing performance problems with the merge join task.
Every time I'm building a nice package using this task, I'm ending up deleting it and using SQL statement in the OLE DB source to accomplish the join since it takes forever to run and crushing my computer at the process.
It makes me feel I don't use the abilities SSIS has to offer compared to DTS.
Of course for the use of several thousands of records it works fine, but in a production surrounding with hundred of thousands of rows, it seems to be futile.

Maybe someone had a little more luck with it?

Liran R wrote:

Hello All,

I'm experiencing performance problems with the merge join task.
Every time I'm building a nice package using this task, I'm ending up deleting it and using SQL statement in the OLE DB source to accomplish the join since it takes forever to run and crushing my computer at the process.
It makes me feel I don't use the abilities SSIS has to offer compared to DTS.
Of course for the use of several thousands of records it works fine, but in a production surrounding with hundred of thousands of rows, it seems to be futile.

Maybe someone had a little more luck with it?

If I were you I would use the OLE DB Source component to do the join - there is absolutely nothing wrong with doing that. If you have a super-performant relational databsae engine at your disposal - why not use it?

Donald Farmer talks around this a little in his OVAL webcast. If you only watch one SSIS webcast in your life then it should be this one.

Donald Farmer's Technet webcast
(http://blogs.conchango.com/jamiethomson/archive/2006/06/14/4076.aspx)

-Jamie

|||Thanks, I'll take a look. does the merge join takes place in the cache?|||

Can you elaborate as to what you mean by "the cache"?

-Jamie

|||

Sure. When I'm using a Lookup component to make the join, The lookup table is been cached by default. When I want to join to a large table (with several millions) I understood It's best practice to use the Merge Join Component, but as I see also this component caches the records in the machines memory, so I don't understand the benefit...

|||

There's a different sort of memory usage going on here.

Yes, the LOOKUP has an area of memory that we call the cache and is used for storing a lookup set.

MERGE JOIN stores data in memory (as do all asynchronous components) but its a different kind of memory and we don't refer to it as a cache. it is just the component's working area - more commonly termed a buffer. Also, this working area will change as MERGE JOIN does its work whereas the LOOKUP cache is static.

It is also worth saying that the MERGE JOIN can spool data to disk if it s running out of memory. LOOKUP cannot do that with its cache.

-Jamie

|||

Hi SSIS friend,

I remember asking a similar question a couple of months back. I think I had 15+ Merge Join components in a test package and the performance was awful. Jamie's advice then and now is pretty sound.

During my short period experimenting with SSIS, I came to realise that in order to create efficient packages, I had to utilise the power of both the SQL Server and SSIS engines. Each one is good at performing certain tasks better than the other. It takes time and I'm still learning, but the more you play around with it, the easier it gets to choose which engine should be used.

|||

Thanks guys.

sql

Merge join: nr of output rows unchanged when amount of input changes

Dear all,

I created a package that seems to work fine with a small amount of data. When I run the package however with more data (as in production) the merge join output is limites to 9963 rows, no matter if I change the number of input rows.

Situation as follows.

The package has 2 OLE DB Sources, in which SQL-statements have been defined in order to retrieve the data.

The flow of source 1 is: retrieving source data -> trimming (non-key) columns -> sorting on the key-columns.

The flow of source 2 is: retrieving source data -> deriving 2 new columns -> aggregating the data to the level of source 1 -> sorting on the key columns.

Then both flows are merged and other steps are performed.

If I test with just a couple of rows it works fine. But when I change the where-clause in the data source retrieval, so that the number of rows is for instance 15000 or 150000 the number of rows after the merge join is 9963.

When I run the package in debug-mode the step is colored green, nevertheless an error is displayed:

Error: 0xC0047022 at Data Flow Task, DTS.Pipeline: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component "Merge Join" (4703) failed with error code 0xC0047020. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure.

To be honest, a few more errormessages appear, but they don't seem related to this issue. The package stops running after some 6000 rows have been written to the destination.

Any help will be greatly appreciated.

Kind regards,

Albert.

If you could post the full error output here, that would probably be helpful. Sometimes it is the "big view" that helps point you at the cause of the problem, especially since many of SSIS's error messages are not particularly transparent. The earlier errors often show what triggered the later errors, even if they do not appear directly related.

|||Can you also include details of the next task in the pipeline, the one that accepts the ~9000 rows. It sounds like it's failing on the first buffer it receives as input.|||

Ok, for the big picture: I feel like a fool.

Solving one of the other errors solved the reported issue as well.

Apologies for bothering you.

Merge Join's poor performance

Hello All,

I'm experiencing performance problems with the merge join task.
Every time I'm building a nice package using this task, I'm ending up deleting it and using SQL statement in the OLE DB source to accomplish the join since it takes forever to run and crushing my computer at the process.
It makes me feel I don't use the abilities SSIS has to offer compared to DTS.
Of course for the use of several thousands of records it works fine, but in a production surrounding with hundred of thousands of rows, it seems to be futile.

Maybe someone had a little more luck with it?

Liran R wrote:

Hello All,

I'm experiencing performance problems with the merge join task.
Every time I'm building a nice package using this task, I'm ending up deleting it and using SQL statement in the OLE DB source to accomplish the join since it takes forever to run and crushing my computer at the process.
It makes me feel I don't use the abilities SSIS has to offer compared to DTS.
Of course for the use of several thousands of records it works fine, but in a production surrounding with hundred of thousands of rows, it seems to be futile.

Maybe someone had a little more luck with it?

If I were you I would use the OLE DB Source component to do the join - there is absolutely nothing wrong with doing that. If you have a super-performant relational databsae engine at your disposal - why not use it?

Donald Farmer talks around this a little in his OVAL webcast. If you only watch one SSIS webcast in your life then it should be this one.

Donald Farmer's Technet webcast
(http://blogs.conchango.com/jamiethomson/archive/2006/06/14/4076.aspx)

-Jamie

|||Thanks, I'll take a look. does the merge join takes place in the cache?|||

Can you elaborate as to what you mean by "the cache"?

-Jamie

|||

Sure. When I'm using a Lookup component to make the join, The lookup table is been cached by default. When I want to join to a large table (with several millions) I understood It's best practice to use the Merge Join Component, but as I see also this component caches the records in the machines memory, so I don't understand the benefit...

|||

There's a different sort of memory usage going on here.

Yes, the LOOKUP has an area of memory that we call the cache and is used for storing a lookup set.

MERGE JOIN stores data in memory (as do all asynchronous components) but its a different kind of memory and we don't refer to it as a cache. it is just the component's working area - more commonly termed a buffer. Also, this working area will change as MERGE JOIN does its work whereas the LOOKUP cache is static.

It is also worth saying that the MERGE JOIN can spool data to disk if it s running out of memory. LOOKUP cannot do that with its cache.

-Jamie

|||

Hi SSIS friend,

I remember asking a similar question a couple of months back. I think I had 15+ Merge Join components in a test package and the performance was awful. Jamie's advice then and now is pretty sound.

During my short period experimenting with SSIS, I came to realise that in order to create efficient packages, I had to utilise the power of both the SQL Server and SSIS engines. Each one is good at performing certain tasks better than the other. It takes time and I'm still learning, but the more you play around with it, the easier it gets to choose which engine should be used.

|||

Thanks guys.

Merge join with empty inputs

Hi all,

Does anyone have suggestions for ways to deal with the chance that a merge join might receive empty inputs?

I've noticed that when this happens the transformation seems to hang. I changed the MaxBuffersPerInput to zero and this seems to cure the problem but I'm not sure it's the best way to deal with it.

Would it be a good idea to test the row counts with a conditional split before such a join?

Cheers,

Andrew

Are you certain that both inputs are generating EORs when they are complete. If so, then this is likely a bug. If not, then there is no way for the MergeJoin to know that it won't be getting buffers so it is not hung it is just waiting for pending input.

HTH,

Matt

|||

I'm pretty sure they did when I ran the package in a cmd window.

Presumably they did after I changed the MaxBuffersPerInput to fix the problem and the data hadn't changed between runs.

I'll need to recreate the state to confirm this which will take a while.

Andrew

|||

Finally got the problem to recur.

In this case it's a simple Inner Join on one column. The left input has no rows coming in, the right has 582 as shown below.

DataFlow: 2006-07-11 16:06:50.97
Source: DFT Get recent returns from Oracle
Component "MRGJ Match on Oracle Ids" (1626) was given end of rowset on input "Merge Join Left Input" (1627)
End DataFlow
DataFlow: 2006-07-11 16:06:50.97
Source: DFT Get recent returns from Oracle
Component "MRGJ Match on Oracle Ids" (1626) will receive 582 rows on input "Merge Join Right Input" (1628)
End DataFlow
DataFlow: 2006-07-11 16:06:51.10
Source: DFT Get recent returns from Oracle
Component "MRGJ Match on Oracle Ids" (1626) was given end of rowset on input "Merge Join Right Input" (1628)
End DataFlow

The package runs in BIDS but hangs with the above as the last output when run through DTEXEC. I also tried switching the left/right inputs just in case but the result was the same.

Any suggestions for a clean way to handle this? I'm trying to work out a consistent way to handle merge joins to deal with this sort of situation.

Andrew

|||

Looks like this may have been resolved by a hotfix:

FIX: When you call a SQL Server 2005 Integration Services package from a SQL Server Agent job step, the package may stop responding

CAUSE: This problem occurs when the SSIS package contains a Merge Join transformation.

http://support.microsoft.com/default.aspx?scid=kb%3Ben-us%3B928243&sd=rss&spid=2855

|||

Did you install the hotfix? Or did you continue to use the 0 MaxInputBuffers property? I just discovered this issues in one of my packages and I'm just curious if that hotfix did fix the problem. The description is just really vuage on the hotfix.

Merge join with empty inputs

Hi all,

Does anyone have suggestions for ways to deal with the chance that a merge join might receive empty inputs?

I've noticed that when this happens the transformation seems to hang. I changed the MaxBuffersPerInput to zero and this seems to cure the problem but I'm not sure it's the best way to deal with it.

Would it be a good idea to test the row counts with a conditional split before such a join?

Cheers,

Andrew

Are you certain that both inputs are generating EORs when they are complete. If so, then this is likely a bug. If not, then there is no way for the MergeJoin to know that it won't be getting buffers so it is not hung it is just waiting for pending input.

HTH,

Matt

|||

I'm pretty sure they did when I ran the package in a cmd window.

Presumably they did after I changed the MaxBuffersPerInput to fix the problem and the data hadn't changed between runs.

I'll need to recreate the state to confirm this which will take a while.

Andrew

|||

Finally got the problem to recur.

In this case it's a simple Inner Join on one column. The left input has no rows coming in, the right has 582 as shown below.

DataFlow: 2006-07-11 16:06:50.97
Source: DFT Get recent returns from Oracle
Component "MRGJ Match on Oracle Ids" (1626) was given end of rowset on input "Merge Join Left Input" (1627)
End DataFlow
DataFlow: 2006-07-11 16:06:50.97
Source: DFT Get recent returns from Oracle
Component "MRGJ Match on Oracle Ids" (1626) will receive 582 rows on input "Merge Join Right Input" (1628)
End DataFlow
DataFlow: 2006-07-11 16:06:51.10
Source: DFT Get recent returns from Oracle
Component "MRGJ Match on Oracle Ids" (1626) was given end of rowset on input "Merge Join Right Input" (1628)
End DataFlow

The package runs in BIDS but hangs with the above as the last output when run through DTEXEC. I also tried switching the left/right inputs just in case but the result was the same.

Any suggestions for a clean way to handle this? I'm trying to work out a consistent way to handle merge joins to deal with this sort of situation.

Andrew

|||

Looks like this may have been resolved by a hotfix:

FIX: When you call a SQL Server 2005 Integration Services package from a SQL Server Agent job step, the package may stop responding

CAUSE: This problem occurs when the SSIS package contains a Merge Join transformation.

http://support.microsoft.com/default.aspx?scid=kb%3Ben-us%3B928243&sd=rss&spid=2855

|||

Did you install the hotfix? Or did you continue to use the 0 MaxInputBuffers property? I just discovered this issues in one of my packages and I'm just curious if that hotfix did fix the problem. The description is just really vuage on the hotfix.

Merge join with empty inputs

Hi all,

Does anyone have suggestions for ways to deal with the chance that a merge join might receive empty inputs?

I've noticed that when this happens the transformation seems to hang. I changed the MaxBuffersPerInput to zero and this seems to cure the problem but I'm not sure it's the best way to deal with it.

Would it be a good idea to test the row counts with a conditional split before such a join?

Cheers,

Andrew

Are you certain that both inputs are generating EORs when they are complete. If so, then this is likely a bug. If not, then there is no way for the MergeJoin to know that it won't be getting buffers so it is not hung it is just waiting for pending input.

HTH,

Matt

|||

I'm pretty sure they did when I ran the package in a cmd window.

Presumably they did after I changed the MaxBuffersPerInput to fix the problem and the data hadn't changed between runs.

I'll need to recreate the state to confirm this which will take a while.

Andrew

|||

Finally got the problem to recur.

In this case it's a simple Inner Join on one column. The left input has no rows coming in, the right has 582 as shown below.

DataFlow: 2006-07-11 16:06:50.97
Source: DFT Get recent returns from Oracle
Component "MRGJ Match on Oracle Ids" (1626) was given end of rowset on input "Merge Join Left Input" (1627)
End DataFlow
DataFlow: 2006-07-11 16:06:50.97
Source: DFT Get recent returns from Oracle
Component "MRGJ Match on Oracle Ids" (1626) will receive 582 rows on input "Merge Join Right Input" (1628)
End DataFlow
DataFlow: 2006-07-11 16:06:51.10
Source: DFT Get recent returns from Oracle
Component "MRGJ Match on Oracle Ids" (1626) was given end of rowset on input "Merge Join Right Input" (1628)
End DataFlow

The package runs in BIDS but hangs with the above as the last output when run through DTEXEC. I also tried switching the left/right inputs just in case but the result was the same.

Any suggestions for a clean way to handle this? I'm trying to work out a consistent way to handle merge joins to deal with this sort of situation.

Andrew

|||

Looks like this may have been resolved by a hotfix:

FIX: When you call a SQL Server 2005 Integration Services package from a SQL Server Agent job step, the package may stop responding

CAUSE: This problem occurs when the SSIS package contains a Merge Join transformation.

http://support.microsoft.com/default.aspx?scid=kb%3Ben-us%3B928243&sd=rss&spid=2855

|||

Did you install the hotfix? Or did you continue to use the 0 MaxInputBuffers property? I just discovered this issues in one of my packages and I'm just curious if that hotfix did fix the problem. The description is just really vuage on the hotfix.

sql

Merge Join vs. Lookup vs. Custom Script - which is fastest?

Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?Have you run your own tests?

Take your source and throw it through each of the above options and finally into a row counter. Compare the time it takes to get through the whole dataflow.|||

TheViewMaster wrote:

Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?

Only you can answer that question. test and measure test and measure, test and emasure.

-Jamie

|||Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?|||

TheViewMaster wrote:

Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?

OK. Well I am loath to give my opinions on performance comparisons but I'd lay alot of money to say that script transform will be slowest.

-Jamie

|||

If you do test the three methods, please post the results here. I am using custom script for lookups (small reference lists but millions of source rows in pipeline) but I would like to know how large reference lists perform.

|||For those posting to this thread and reading it, please watch the Webcast presented by Donald Farmer on performance and scale in SSIS. In there Donald talks about benchmarking and how to set up SSIS to obtain timings associated with different aspects of a package.

TechNet Webcast: SQL Server 2005 Integration Services: Performance and Scale (Level 400)
MS TechNet Event ID: 1032298087

I don't know if this link will work for anyone:

https://msevents.microsoft.com/CUI/Register.aspx?culture=en-US&EventID=1032298087&CountryCode=US&IsRedirect=false|||Where can I report a BUG about this forum - I have a 50/50 chance that when I try to create a hyperlink in my post - the Firefox crashes.
(Thank god I copied and pasted the following post to notepad before "doing the hyperlink trick")|||So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.

Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usage

My Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm|||

TheViewMaster wrote:

So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.

Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usage

My Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm

Excellent stuff. This is really valuable information. Thank you. I've updated my post with a link to here.

|||

Yes thanks for posting very interesting info. Today I am going to change all my script lookups to use stringbuilder class and methods (strongly recommended in all the .net literature where performance is important when modifyng strings). Currently all my lookup script transforms use object based .net string variables which are notoriously terrible performers when the string values are repeatedly modified. Do you know which approach your script transform used? (assuming your are creating and modifying string variables in your lookup script)...

If I detect the same low processor usage in my script lookups I may also try and partition the pipeline to get a lookup to run with multiple threads...

Ken

|||My script does a lookup something similar to as described aforementioned Van Mullem article:

Public Overrides Sub

PreExecute()

sqlCmd = New

SqlCommand("SELECT KeyCustomer, CustomerName

FROM tblCustomer WHERE(KeyCustomer = @.KeyCustomer)", sqlConn)

sqlParam = New

SqlParameter("@.KeyCustomer",

SqlDbType.Int)

sqlCmd.Parameters.Add(sqlParam)

End Sub

Public Overrides Sub

CustomerRecordsInput_ProcessInputRow(ByVal Row As CustomerRecordsInputBuffer)

Dim

reader As SqlDataReader

sqlCmd.Parameters("@.KeyCustomer").Value = Row.CUNO

reader = sqlCmd.ExecuteReader()

If

reader.Read() Then

Row.DirectRowToUpdateRecordsOutput()

Else

Row.DirectRowToInsertRecordsOutput()

End If

reader.Close()

End Sub

|||Ken - is your script performing a lookup from another source in pipeline?
<boy i'd like to know how to do that>

Also - any suggestions how to improve performance of OLE DB Update command?|||

Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

A non cached lookup will be expremely slow as was your script component.

|||

Crispin wrote:

Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

A non cached lookup will be expremely slow as was your script component.

It would be best to try to replicate full caching in the script component. The purpose of the exercise was to see which was faster. So, we know how fast (and legitimately so) the lookup component was, now how fast can we get the script component to process?

The question is how fast can each of the elements process their data, not how slow can we make them work.

Phil

Merge Join vs. Lookup vs. Custom Script - which is fastest?

Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?Have you run your own tests?

Take your source and throw it through each of the above options and finally into a row counter. Compare the time it takes to get through the whole dataflow.|||

TheViewMaster wrote:

Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?

Only you can answer that question. test and measure test and measure, test and emasure.

-Jamie

|||Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?|||

TheViewMaster wrote:

Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?

OK. Well I am loath to give my opinions on performance comparisons but I'd lay alot of money to say that script transform will be slowest.

-Jamie

|||

If you do test the three methods, please post the results here. I am using custom script for lookups (small reference lists but millions of source rows in pipeline) but I would like to know how large reference lists perform.

|||For those posting to this thread and reading it, please watch the Webcast presented by Donald Farmer on performance and scale in SSIS. In there Donald talks about benchmarking and how to set up SSIS to obtain timings associated with different aspects of a package.

TechNet Webcast: SQL Server 2005 Integration Services: Performance and Scale (Level 400)
MS TechNet Event ID: 1032298087

I don't know if this link will work for anyone:

https://msevents.microsoft.com/CUI/Register.aspx?culture=en-US&EventID=1032298087&CountryCode=US&IsRedirect=false|||Where can I report a BUG about this forum - I have a 50/50 chance that when I try to create a hyperlink in my post - the Firefox crashes.
(Thank god I copied and pasted the following post to notepad before "doing the hyperlink trick")|||So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.

Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usage

My Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm|||

TheViewMaster wrote:

So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.

Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usage

My Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm

Excellent stuff. This is really valuable information. Thank you. I've updated my post with a link to here.

|||

Yes thanks for posting very interesting info. Today I am going to change all my script lookups to use stringbuilder class and methods (strongly recommended in all the .net literature where performance is important when modifyng strings). Currently all my lookup script transforms use object based .net string variables which are notoriously terrible performers when the string values are repeatedly modified. Do you know which approach your script transform used? (assuming your are creating and modifying string variables in your lookup script)...

If I detect the same low processor usage in my script lookups I may also try and partition the pipeline to get a lookup to run with multiple threads...

Ken

|||My script does a lookup something similar to as described aforementioned Van Mullem article:

Public Overrides Sub PreExecute()

sqlCmd = New SqlCommand("SELECT KeyCustomer, CustomerName FROM tblCustomer WHERE(KeyCustomer = @.KeyCustomer)", sqlConn)

sqlParam = New SqlParameter("@.KeyCustomer", SqlDbType.Int)

sqlCmd.Parameters.Add(sqlParam)

End Sub

Public Overrides Sub CustomerRecordsInput_ProcessInputRow(ByVal Row As CustomerRecordsInputBuffer)

Dim reader As SqlDataReader

sqlCmd.Parameters("@.KeyCustomer").Value = Row.CUNO

reader = sqlCmd.ExecuteReader()

If reader.Read() Then

Row.DirectRowToUpdateRecordsOutput()

Else

Row.DirectRowToInsertRecordsOutput()

End If

reader.Close()

End Sub

|||Ken - is your script performing a lookup from another source in pipeline?
<boy i'd like to know how to do that>

Also - any suggestions how to improve performance of OLE DB Update command?|||

Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

A non cached lookup will be expremely slow as was your script component.

|||

Crispin wrote:

Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.

A non cached lookup will be expremely slow as was your script component.

It would be best to try to replicate full caching in the script component. The purpose of the exercise was to see which was faster. So, we know how fast (and legitimately so) the lookup component was, now how fast can we get the script component to process?

The question is how fast can each of the elements process their data, not how slow can we make them work.

Phil

Merge Join Unexpected Results

I noticed a possible bug with the merge join in the June CTP and wanted to see if anyone else is experiencing these issues.

I have two sorted oledb sources with a sort order on columns TicketNumber (1), SequenceNumber (2). Later in my transformation I want to apply a merge join only on TicketNumber (1). The metadata still shows my sort order of TicketNumber, SequenceNumber but my join condition only uses the TicketNumber column.

Upon execution, the first sequence of the ticket on left input joins to every ticket sequence on the right. This is the expected behavior. However, once the sequence number on the left changes and the ticket number remains the same, the left joins to nothing on the right.

In order to get my expected result I had to resort the data only on the TicketNumber (1) column to clear the metadata of the sort order. Below is an example of my results and expected results.

Just wondering what others are experiencing and if this is a potential bug or not. The reduntant sort is killing my performance, but I need the two column sort for work done in previous steps.

Thanks,

Adam

Example.

Left Table
Ticket,Seq,Data
1,1,A
1,2,B
1,3,A

Right Table
Ticket, Seq, Data
1,1,A
1,2,B
1,3,A

Joined Result
L.Ticket, L.Seq, L.Data, R.Ticket, R.Seq, R.Data
1, 1, A, 1, 1, A
1, 1, A, 1, 2, B
1, 1, A, 1, 3, A
1, 2, B, NULL, NULL, NULL
1, 3, A, NULL, NULL, NULL

Expected Result
L.Ticket, L.Seq, L.Data, R.Ticket, R.Seq, R.Data
1, 1, A, 1, 1, A
1, 1, A, 1, 2, B
1, 1, A, 1, 3, A
1, 2, B, 1, 1, A
1, 2, B, 1, 2, B
1, 2, B, 1, 3, A
1, 3, A, 1, 1, A
1, 3, A, 1, 2, B
1, 3, A, 1, 3, A

I have seen this problem before. If I remember correctly, you have to make sure only rows that you want to join on are sorted. It seems as though the merge-join pays more attention to the rows that are sorted than to the "Join Key" checkboxes.
I filed a bug on this and this is the reply:
You can either uncheck the joinkey column in the custom editor or you can change the NumKeyColumns property in the properties grid or advanced editor
I haven't had time to test the NumKeyColumns property, let me know if it works.
-Evan Black|||Evan is right.

Adam, what's the value for NumKeyColumns (a property on Merge Join transform)? If you set to 1, you will get the expected results. That worked for me.|||Thanks guys that did the trick.

Merge Join transformation hangs

I have serached through this forum, and I could not find the solution.

My SSIS data flow is a simple one. Extracting two tables from Oracle using OLE DB Source , join them together using merge join and loading into a table in SQL server by SQL Server Destination.

I haven't gone through this simple procedure because Merge join always hangs there. Down to further investigation, I found one input (randomly one of two inputs) is always stuck. Sometimes the input is empty, sometimes is about half way.

Is there workround to see what is happening there and to fix this problem?

TIA

As far as I can see, Mergejoin is not the culprit, it is by design for MergeJoin to wait until it gets buffers from both inputs before it outputs. Looks to me your input(s) did not get through at your Oracle server. I suggest you to use an external tool (rowsetviewer?) to perform the same queries you put on the two OLEDBSrc inputs, to see whether it works there.

thanks

wenyang

|||

Also make sure the 2 inputs of the Merge Join are actually sorted.

Rafael Salas

|||

Hi, Wenyang & Rafael,

Thanks very much for your input.

The 2 inputs of the Merge Join are sorted by "order by" and the Issorted = true. I found the killer is the "data viewer" of the path, which shows the input data. Once I removed the "data viewer", the problem's gone. I cannot explain why.

BTW, what's the external tool (rowsetviewer)? Where can I get it?

Merge Join question

Hi, All,

In the data flow of Intergration Service, my understanding is that the "Merge Join" only does the "join", there is no way to input the "where" clause. If I need to input the the "Where" clause or selection criteria, how should I do it in Integration Service?

TIA,

Maybe put in a conditional split transformation before you go into the merge... Filter your rows there.|||

john zhang wrote:

Hi, All,

In the data flow of Intergration Service, my understanding is that the "Merge Join" only does the "join", there is no way to input the "where" clause. If I need to input the the "Where" clause or selection criteria, how should I do it in Integration Service?

TIA,

Depending in the source you are using; I would use a SQL statement right in the surce components to 'pull' only required rows; otherwise I would use conditional split transforms within the data flow.

Rafael Salas

|||Thanks Phil.|||

Thanks Rafael for the input. Now I have "conditional split" worked for me even it's inconvenient

|||

john zhang wrote:

Thanks Rafael for the input. Now I have "conditional split" worked for me even it's inconvenient

Why is it inconvenient?

-Jamie

|||

With other popular ETL tool, in one transformation, you can do both "join" and "where". But in SQL Server 2005 Integration service, you have to use two transformation to finish the task.

|||

john zhang wrote:

With other popular ETL tool, in one transformation, you can do both "join" and "where". But in SQL Server 2005 Integration service, you have to use two transformation to finish the task.

Why is that a bad thing? Is having 2 boxes on the design surface really such a bad thing?

SSIS components are deliberately "atomic". By that I mean they do one single operation and thus the aim is to build data-flows which are greater than the sum of their parts. Just because the "join" and "where" is in the same box then it doesn't mean its doing any less work. In fact it might actually be doing more because you are using a CPU to do two operations whereas in SSIS this may not be the case.

Please don't think I'm trying to say that you're wrong - this is just my opinion and in a way I'm playing devil's advocate. I think its a useful discussion to have.

-Jamie

|||

here's a discussion, the merge join would be even better if it could join more than 2 tables at a time.

I have many packages that require a left join to 5 or 6 other tables and the only way to do this is to use a merge join for the first two tables, then get the data from the 3rd table, add another merge join transform, then join the output of the first merge join to the data from the third table to get a 3 table join and so on...

It works ok though..

|||

bobbins wrote:

here's a discussion, the merge join would be even better if it could join more than 2 tables at a time.

I have many packages that require a left join to 5 or 6 other tables and the only way to do this is to use a merge join for the first two tables, then get the data from the 3rd table, add another merge join transform, then join the output of the first merge join to the data from the third table to get a 3 table join and so on...

It works ok though..

Nice idea. Submit it at http://connect.microsoft.com

-Jamie

sql

Merge Join Problem

Hello,
I've got a problem with the merge join operator.

I'm trying to Join two tables using a left outer join.

Here are my tables :

TABLE AGENT
AGENT_CODE; REGION_ID; COUNTRY_ID
AA;01;01;
BB;01;01;
CC;01;02;
CC;02;02;
DD;01;01;

600 records in this table.

TABLE DELEGUE
AGENT_CODE; FIRST_NAME; LAST_NAME
AA; Maradona;Diego;
BB; Maradona;Diego;
DD; Zidane;Zinedine;

145 records in this table.

This is what i expect in my left outer join:
AGENT_CODE; REGION_ID; COUNTRY_ID;FIRST_NAME; LAST_NAME
AA;01;01;Maradona;Diego;
BB;01;01;Maradona;Diego;
CC;01;02;NULL;NULL
CC;02;02;NULL;NULL;
DD;01;01;Zidane;Zinedine

600 records in the destination table, with only 145 Not null first_name and last_name

The issorted property has been set to true in the 2 sources tables, the numkeyscolumns property is set to 1, and the join keys are both checked.
When i execute the sql query using the editor in sql server, the result is good, i've got what i've expected.
But when i run the dataflow, the destination has got 600 records, but the columns FIRST_NAME and LAST_NAME are not null only 8 times, the 592 other lines are NULL.
Could someone explain to me why i lost 145-8=137 FIRST_NAME and 145-8=137 LAST NAME ?
Thank you in advance.

by left join as the table on left has 600 records, therefore it must result 600 rows, i think so

can you also write your query also here

you can also check by sub query that how many rows exactly match in your child table

|||I changed the table names in first post so that anyone could understand (i'm french), the real table names are :
CODE_AGENT (600 records) and DELEGUE (145).
i want to join these two tables using CODE_AGENT.AGENT_ID and DELEGUE.CODE_AGENT

this is the query i made :
SELECT DELEGUE.NOM_DELEGUE, DELEGUE.PRENOM_DELEGUE, CODE_AGENT.NOM_AGENT, CODE_AGENT.CODE_SECTEUR,
CODE_AGENT.CODE_EQUIPE, CODE_AGENT.CODE_USINE, CODE_AGENT.AGENT_ID, CODE_AGENT.REGION_ID
FROM CODE_AGENT LEFT OUTER JOIN
DELEGUE ON CODE_AGENT.AGENT_ID = DELEGUE.CODE_AGENT

The result is exactly what i want, but when i use SSIS, i lost many records of DELEGUE.NOM_DELEGUE, DELEGUE.PRENOM_DELEGUE.|||

You said that the IsSorted property is set to true and the sortkeyposition property is set to 1 but you didn't say that the data was sorted. Can you verify that the data is actually sorted. Setting these properties don't do anything except tell the dataflow that the data is sorted. If it actually is not sorted then the results of your MergeJoin are undefined.

Thanks,

Matt

|||Thank you Matt !
That was the problem, i didn't understand that i had to sort the table for each merge join.
The result is correct!
I added a "Sort" data flow transformation and the result is ok!

Merge Join Output Bug?

I've run into something that looks like a bug to me but I wanted to run it by the board:

Merge join 2 sorted tables.

Table1: ColumnA : Sort Order 1, ColumnB Sort Order 2

Table2 : ColumnA: Sort Order 1, ColumnB Sort Order 2, ColumnC not sorted

Merge Join the two tables on ColumnA and ColumnB...

Choose the following as output columns

A + B + C = works

C = works

A + C = works

B + C = NOT work.. error message: The column with the SortKeyPosition value of 0 is not valid. It should be 2.

Basically if you choose one or more of the sorted columns in the output at least one of them has to be the column with Sort position 1 or you'll get that error.

Is this a bug or intentional? If you do not have sort column 1 in the output that output could no longer be considered sorted... so perhaps the error is related to that (instead of error I'd expect some warning about the sorting). Interesting that it lets you choose C only becuase that also makes the output unsorted.

I see your point Chris.

I think it is intential -

. The reason why B+C not work is because column B has a non-zero sortKeyPosition which indicates the output (to which B belongs) should be sorted (in other words, the output's "isSorted" property is true), but the output can not find a column with SortKeyPosition 1

. As for why C column only works is because the output is then not sorted.

If you think the error message is not very helpful, please log a customer issue through our connect website http://connect.microsoft.com/SQLServer and your request will be addressed soon as appropriate.

Thanks

wenyang

Merge Join Output Bug?

I've run into something that looks like a bug to me but I wanted to run it by the board:

Merge join 2 sorted tables.

Table1: ColumnA : Sort Order 1, ColumnB Sort Order 2

Table2 : ColumnA: Sort Order 1, ColumnB Sort Order 2, ColumnC not sorted

Merge Join the two tables on ColumnA and ColumnB...

Choose the following as output columns

A + B + C = works

C = works

A + C = works

B + C = NOT work.. error message: The column with the SortKeyPosition value of 0 is not valid. It should be 2.

Basically if you choose one or more of the sorted columns in the output at least one of them has to be the column with Sort position 1 or you'll get that error.

Is this a bug or intentional? If you do not have sort column 1 in the output that output could no longer be considered sorted... so perhaps the error is related to that (instead of error I'd expect some warning about the sorting). Interesting that it lets you choose C only becuase that also makes the output unsorted.

I see your point Chris.

I think it is intential -

. The reason why B+C not work is because column B has a non-zero sortKeyPosition which indicates the output (to which B belongs) should be sorted (in other words, the output's "isSorted" property is true), but the output can not find a column with SortKeyPosition 1

. As for why C column only works is because the output is then not sorted.

If you think the error message is not very helpful, please log a customer issue through our connect website http://connect.microsoft.com/SQLServer and your request will be addressed soon as appropriate.

Thanks

wenyang

Merge Join or Union All

Hi,

I am trying to normalize data using the unpivot transform. I have to unpivot using more than one key so I have a multicast feeding into two unpivot transforms then into a sort transform. This is where my problem starts - I have tried using a Merge Join (inner Join) transform but dont get the expected result.

My original data looks like this:

Pk_ID

Choice1

Choice2

Feedback1

Feedback2

10

a

b

x

y

After the mulitcast - unpivot - Merge Join, the expected result is: (pk_newID is an identity)

Pk_newID

fk_ID

Choice

Feedback

563

10

a

x

564

10

b

y

However with a Merge-Join (inner join on pk_ID) I get

Pk_newID

fk_ID

Choice

Feedback

563

10

a

x

564

10

a

y

565

10

b

x

566

10

b

y

Is the Merge Join transform not the right choice?

Thanks

Which columns are you using for the join in the Merge join component...if you use only fk_id in your join the output, even when the output you get is not the expected; it is perfectly valid.

It seems like you need something else for your join...or may be the unpivot is not an option in this case

Rafael Salas

|||I think you are going to perhaps have to do some conditional logic as was done in your other thread. Take a look at it from a different perspective as I did in my example and perhaps you can get to where you need to be...

Merge join not matching "correctly"

And by correctly, I mean the way it *should* match, of course.

I've got 2 data sources, using a left outer join, matching 2 columns. Whatever is the right side of my table is never matching and returning data. Here's my basic setup:

OLE DB Source 1 (table1)
Sel Name Order JoinKey
no UniqueID 1 yes
yes Column1 0 no
...

OLE DB Source 2 (table2)
Sel Name Order JoinKey
yes ID 1 yes
yes Columns 0 no
...

There is a link (arrow) between UniqueID and ID, and the join type is "Left Outer Join". When I execute the statement "SELECT * FROM Table1 LEFT JOIN TABLE2 ON TABLE1.UniqueID = Table2.ID", the data returns correctly to me. What am I missing with the properties I've set above with the merge join?


some more information... my two queries are from variables...

"select * from Table1 ORDER BY UniqueID ASC"
and
"select * from Table2 ORDER BY ID ASC"

|||

Are the join columns the same data type?

I set up a similar scenario, and it's working fine. It looks like you've set your SortKeys, etc, so it should work.

Have you added data viewers before the merge join to verify the data is what you expect?

|||Just a general observation: it seems very odd to only be selecting the key from the right side in a left join. Maybe you have a reason for doing it that way, but as a troubleshooting step I'd swap it to the left side to see if that changed anything.

I don't know what data types your ID columns are or if this even applies, but the SSIS components perform case-sensitive matches while SQL Server's default collation is case-insensitive. This allows things that match in SQL not to match in SSIS.
|||

papalarge wrote:

And by correctly, I mean the way it *should* match, of course.

I've got 2 data sources, using a left outer join, matching 2 columns. Whatever is the right side of my table is never matching and returning data. Here's my basic setup:

OLE DB Source 1 (table1)
Sel Name Order JoinKey
no UniqueID 1 yes
yes Column1 0 no
...

OLE DB Source 2 (table2)
Sel Name Order JoinKey
yes ID 1 yes
yes Columns 0 no
...

There is a link (arrow) between UniqueID and ID, and the join type is "Left Outer Join". When I execute the statement "SELECT * FROM Table1 LEFT JOIN TABLE2 ON TABLE1.UniqueID = Table2.ID", the data returns correctly to me. What am I missing with the properties I've set above with the merge join?


Papa,

In the top of the sugestions above:

Are the OLE DB source components joined directly to the Merge join or there are other transfoms in between? The second could alter the original order of the rows.

Have you tried placing a sort transfom before each merge join input? I know you are using a order by in the queries; but just to test it.

BTW,

Why are you using Merge Join?

I noticed in your original post that you can run a query where you actually join the 2 tables in a single query. Did you know you could place the same query in a single OLE DB Source component? That would give you better performance and less trouble. If a single query is not an option; what about having a 1 table OleDB source component that goes to a Lookup transform to add the columns from the 2nd table.

|||Can't believe I didn't think of that. Yeah, just ended up using a merge in sql, and not through the IDE. Thanks for the help all... still don't know exactly why the merge results weren't connecting. I guess the next step was going to be getting a data view control working so I could see exactly what's returning.

Thanks again...
sql