Friday, March 30, 2012

Merge join more than 2 tables

I need to take certain items of data from four different tables and out them into one table.

Unfortunately my source data's version of SQL does not support the LEFT JOIN keyword which has left me with a bit of a problem.

I saw the merge join in SSIS and used it to get data from two of the tables and stick them in the destination and it all worked fine.

That got me thinking, is it possible to create a second merge join transformation within the same data flow task for the remaining two tables and then join the output of both the merge joins to give me the data I need from all four tables in one output?

I cannot answer your direct question about the merge join, but wanted to make one comment. It is possible to recast a Left Join query as a Union and this might be a viable approach for your problem. The general form is to make a Union of the simple Join and a Not IN query.

Merge Join gives unexpected results when the input has speacial characters

Hi
I have a strange issue here.
Both the input datasets of the merge join are sorted . Everything works fine if the dataset doesn't contain speacial characters like - (hipen).
This is the sorted dataset.
ASI311_3
ASI311_3
ASI312
ASI311
ASIKit1
ASIKit1

Merge Joiner has to look for ASI311 and finds it in the list and everything works as expected. But if the dataset is changed to

ASI311_3
ASI311_3
ASI-312
ASI311
ASIKit1
ASIKit1

Then it doesn't find the ASI311 in the above dataset ?
Only change between the two datasets is the - (hipen) which is added in one of the field. Does SSIS has a differenent precedence of the special characters than that of the Database. Both the datasets are on SQLServer.

Can anyone point out the issue here?

Thankyou
Vidya

What is the sort column's data type (in SQL server, not SSIS) and what is its collation?

The issue is highly like one of collation. Hyphen sorts to the top in SQL collations , and to the bottom in Windows collations. If your SQL server column is using a SQL collation, like SQL_Latin_General1_CI_AS, then an "ORDER BY" on the column of SQL collation would not of necessity match an SSIS sort (or a merge join which requires data of a particular collation).

If so, its likely you have a database collation equal or similar to SQL_Latin_General1_CI_AS on the database columns. Data with hyphens present will sort differently in SQL than in SSIS using the aforementioned SQL collation.

You have some options though.

1. Change the collation on the database column from a SQL collation to a windows collation. From example you would use Latin_General1_CI_AS rather than SQL_Latin_General_CI_AS.

2. Change the column data type to the national character variant, e.g. from varchar to nvarchar.

3. Lastly, you can change collation in the source components's "select" statement itself, as in "select x from t order by x COLLATE Latin_General1_CI_AS". This sort order is a windows collation compatible with SSIS.

Now, SSIS is also case-sensitive, whereas SQL server (by default) is not. However, that does not appear to be your issue here, since there is no "captailized" hyphen.

|||
Thanks a lot jadge , it worked Smile
I changed the collation from SQL to Windows as mentioned and it worked .

Thankyou
Vidya
|||Hi,
What is the way to set the same type of property in Oracle 10g database.
Do we need to use NLS_SORT property ?
Can anyone point me to that.

Thankyou
Vidya

Merge Join gives unexpected results when the input has speacial characters

Hi
I have a strange issue here.
Both the input datasets of the merge join are sorted . Everything works fine if the dataset doesn't contain speacial characters like - (hipen).
This is the sorted dataset.
ASI311_3
ASI311_3
ASI312
ASI311
ASIKit1
ASIKit1

Merge Joiner has to look for ASI311 and finds it in the list and everything works as expected. But if the dataset is changed to

ASI311_3
ASI311_3
ASI-312
ASI311
ASIKit1
ASIKit1

Then it doesn't find the ASI311 in the above dataset ?
Only change between the two datasets is the - (hipen) which is added in one of the field. Does SSIS has a differenent precedence of the special characters than that of the Database. Both the datasets are on SQLServer.

Can anyone point out the issue here?

Thankyou
Vidya

What is the sort column's data type (in SQL server, not SSIS) and what is its collation?

The issue is highly like one of collation. Hyphen sorts to the top in SQL collations , and to the bottom in Windows collations. If your SQL server column is using a SQL collation, like SQL_Latin_General1_CI_AS, then an "ORDER BY" on the column of SQL collation would not of necessity match an SSIS sort (or a merge join which requires data of a particular collation).

If so, its likely you have a database collation equal or similar to SQL_Latin_General1_CI_AS on the database columns. Data with hyphens present will sort differently in SQL than in SSIS using the aforementioned SQL collation.

You have some options though.

1. Change the collation on the database column from a SQL collation to a windows collation. From example you would use Latin_General1_CI_AS rather than SQL_Latin_General_CI_AS.

2. Change the column data type to the national character variant, e.g. from varchar to nvarchar.

3. Lastly, you can change collation in the source components's "select" statement itself, as in "select x from t order by x COLLATE Latin_General1_CI_AS". This sort order is a windows collation compatible with SSIS.

Now, SSIS is also case-sensitive, whereas SQL server (by default) is not. However, that does not appear to be your issue here, since there is no "captailized" hyphen.

|||
Thanks a lot jadge , it worked Smile
I changed the collation from SQL to Windows as mentioned and it worked .

Thankyou
Vidya
|||Hi,
What is the way to set the same type of property in Oracle 10g database.
Do we need to use NLS_SORT property ?
Can anyone point me to that.

Thankyou
Vidya

Merge Join conditional split - HeadBanger

I have a merge join that does a full outer join. I then have a conditional split that will breakout by unchanged, insert and update. The update is what I am having a problem with. The conditional split for the update is a follows:

!ISNULL(HISTORICAL) && !ISNULL(TRANSFORM)

Now I believe the problem is related to spaces in the key field let me explain.

The join field is defined as [char](14) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

There was an original mismatch between the two tables of VARCHAR(14) & Char(14) but now there both Char(14).

What's interesting is that the few records with Alpha characters correctly does the update but the records with starting numeric data all go down the insert path.

The data of the join columns look like

'308811151 ' - 5 Spaces This Data incorrectly goes to insert path

'TSTRWR02 ' 6 Space This Data correctly goes to update

The data is grouped and sorted by the Key field and the historical and transform column contain a 1 for all records.
while the Doc_nbr which is the key


I tested the join in SQL server and it works with and without the spaces.

From SSIS I also tried RTRIM in the SQL command of the source in addition to substring the first 9 with the right trim and I always received the same results.

I am guessing there is something going on with the not isnull in the conditional split but I can't figure it out if I am missing something or if this is a BUG.


Three paths of data after the merge join and the conditional split

Insert
DOC_NBR,HISTORICAL,TRANSFORM,DOC_NBR_historical
206352185 ,,1,
209900165 ,,1,


Update
DOC_NBR,HISTORICAL,TRANSFORM,DOC_NBR_historical
NEE ,1,1,NEE
New ,1,1,New


Unchanged
DOC_NBR,HISTORICAL,TRANSFORM,DOC_NBR_historical
,1,,0000521096
,1,,0000685216

Any suggestions, am I missing the obvious?

LL


I beleive this to be a bug relating to a combination of the data relating to a char 14 which gets populated with space to the right on short columns in conjunction with the issorted property and the sortkey position of the oledb data source for sql server with the access mode SQL Command.

A work around on this was to remove the order by in the sql statements and to add a sort component before each input to the merge join.

So if anyone receives the same behavior, replace with sort components at least until you can find another solution like the look-up mentioned on this forum. You will take a little bit of a performance hit but at least you will get the anticipated results.

On a lighter note, Currently It's 78 degrees here in Miami and I didn't need a web service to figure that out.

LL

sql

Merge Join = 99%

I have got a query in which a merge join is 99% of the cost ... and I am confused ... is not merge join supposed to be the fastest ? Anyone seen this before ?
Any ideas why this could be happening ... and sorry ... do not ask me to post the code coz I will not be able to ...just because it is 99% of the cost that does not indicate that it was excessive. it just means that out of 100% of total cost required to run the query, (parsing resolving compiling (serial and parallel plans) and execution)
that the execution was 99% of the total operation. this indicates that figuring it out isnt taking a long time but doing it is.

now I believe that when you say 99% of the cost you mean in the Graphical execution plan, then that is still justifiable considering the amount of work that sql has to do to perform a dual sorted join.

sometimes if a merge join is warranted, sql server will sort unsorted columns because the optimizer feels that the merge join is more beneficial to the query than the cost of not performing a merge join.

if one of the columns involved is a foriegn key and it is not indexed you might want to consider a nonclustered index on this column.
this will presort the data in the column and speed up the merge.
the Graphic EX will probably still indicate the 99% (of the process time)but your IO and or execution time might decrease.|||There is a non clustered index on the columns ... but the optimizer is choosing the clustered index ... and the merge join is taking 5 minutes ...|||Is the optimizer scanning the clustered index, or doing a seek? Also, how is the where clause on this pig? If you can eliminate "OR"s or "(NOT) IN"s, you will probably do better.|||The query is on a big fat b#$%%rd table of about 11 million rows ...

Should i take the table with all its where clauses into another temp table and then do a join ... currently I am using a derived table ...|||How about the DDL and the DML...skip the sample data...

Don't forget to script the indexes|||Just read up on "Understanding Merge Joins" in BOL. Sounds like in your case, the data is not sorted on the merge keys going into the query. Also, the fact that you are dealing with millions of rows. I have only seen the merge join a few times, but every time has been with 1 million rows or more.

Just for grins (because I doubt it will work), what is the result of throwing an order by on the subquery? The order by should be the same order as the join, and ideally there should be an index on the table you are joining to with that same order.

And this is just the thing I would need to have someone else look at myself, because you never know what is going to jump out at a different pair of eyes.|||I would avoid the ORDER BY

It's more overhead...until you get the final result (which is how big again?)

It's either going to use the index or it's not.

Where's the driver data coming from, since your doing a merge...

And it will have to scan the driver....|||Another pair of eyes .. i am not gonna get ... am working at client's place all alone on my laptop ... away from my office ... and as I said before in the YAK Corral... that Shark tank story looks a lot like mine ...|||Another pair of eyes .. i am not gonna get

Dude we are your eyes...

Can't you post it?|||Just my luck ... am under a strict NDA|||So am I...just under a completely unrelated project ;-).

As for the sort in the subquery, it is just a guess based on the description of how a merge jin works. It may not work, but it sounds like this is getting down to the "Hail Mary" point.|||Dude we are your eyes...

Anyway ... Thanks for the nice words ...

What do you think would give better performance under the conditions ...
a) Temp table
b) Table variable
c) Derived table

The table contains data for an year ... the selection is for a month ... and then have to make a join on a different col from another big fat table containing about 80,000 records ...|||It's ALL about the indexes...

Get the result set as small as possible...

how many rows in a month of data..if it's not too big you could use a table datatype...

I still like derived tables

In the predicate...is there an index for every column..are they in the right order?

How many columns in the orw?|||How many rows in option 1, 2, or 3?|||No of rows in a month ... 11 million divided by 12 ...
And I too like derived tables ...

not many colums in the row ... if I remember correctly .. 8 or 9 ...

Ok .. going to sleep now ... will update how it goes tomorrow ...|||avoid the order by as much as possible it wont affect the sort order of the table in the case of the join.

i would consider that the merge is going to occur in this case regardless based on the sql optimizer

i would try a join hint (in development only) and initially i would try to set forceplan on. just to see what the io is on your current query as written.
they i would play with the join order just to see how the optimizer chooses it's plan.
oh yeah
dont forget to update stats and check your indicies etc... before you start using optimizer hints.|||avoid the order by

What did I say....

so it's 1 million...not huge, but still alot...

Is the date indexed?

With 8-10 columns the best you'll have would be index seek, bookmark lookup...

Did you SHOW PLAN yet?

Merge Join (Full outer join) never finishes.

I have a merge join (full outer join) task in a data flow. The left input comes from a flat file source and then a script transformation which does some custom grouping. The right input comes from an oledb source. The script transformation output is asynchronous (SynchronousInputID=0). The left input has many more rows (200,000+) than the right input (2,500). I run it from VS 2005 by right-click/execute on the data flow task. The merge join remains yellow and the task never finishes. I do see a row count above the flat file destination that reaches a certain number and seems to get stuck there. When I test with a smaller file on the left it works OK. Any suggestions?

Try this - replace your destination with a RowCount component - it does nothing except count rows into a variable. if the package succeeds, then you have narrowed the problem down to an issue with your destination.

If there is an issue with your destination, you should start to look into disk and memory usage.

hth - I'll look forward to hearing more.

Donald

|||The way I worked around the problem was to separate it into 2 dataflows, the first one just taking the left branch of the input to the merge join and landing it to a raw destination. Then in the second data flow (the merge join) I simply used the raw file as the left input and it worked fine. Does that eliminate the possibility of it being an issue with the destination? If not then I will recreate the original scenario and replace the destination with a RowCount component as you suggested. Thanks.|||

That's a tricky one. I wonder if the combined memory usage of your script component and the merge join was stressing the box.

Donald

Merge Join - HELP

Hi,

I have a SQL Statatment:

SELECT * FROM TABLE1 AS A
JOIN TABLE2 AS B
ON A.X= B.X
AND A.Y= B.Y

When i execute this code in sql server returns 549 lines. I created a package with two oledb sources one for each table, sorted the tables with fields X and Y after placed a Merge Join with the fields:

A.Y join B.Y order 1

A.X join B.X order 2

both fields with the Join Key checked

But my package return 411 lines.

What's happened? :(

When a i have the code:

SELECT A.X, A.Y, B.X, B.Y
FROM TABLE1 AS A
JOIN TABLE2 AS B
ON A.X= B.X

When i did the join only one field SSIS worked fine, sql server returns 622 and SSIS returns 622 lines.

Please help-me...

Thanks,

Andr

Are your join keys all integers or does one of them have characters. SSIS is case senstive while SQL Server by default isn't. You may want to check this.|||

Also check that the inputs of the Merge Join are sorted...

Rafael Salas

|||

Thanks Jay,

My columns are characters type and my oledb source 1 columns is lowercase and oledb source 2 columns is uppercase.

I created a Derived Column replaced my column to uppercase. :)

|||Character Map Transform might be more appropriate to change case|||

Rafael Salas wrote:

Also check that the inputs of the Merge Join are sorted...

Rafael Salas

Very good point. And note that setting IsSorted=1 on an output does NOT sort the data.

-Jamie