Friday, March 30, 2012
Merge Large Tables
and very wide)?
The table definitions are the same for both tables.
Right now I am using a insert into statement with a selection from one table
at a time, but it takes way too long. I don't need it to be logged. A all
or nothing result is fine for me. I don't think DTS is an option because I
need to run this from a C# app. I think my only other option is using the
BCP api to select the data and load it into the new table, but this just
seems like the wrong way to go.
Any other ways to go with this?
Shawn
Select *
into [NewTable]
From
Select * [Table1]
Union All
Select * [Table2]
"Shawn Meyer" <me@.me.com> wrote in message
news:O4GhS6QQFHA.580@.TK2MSFTNGP15.phx.gbl...
> What is the best way to merge two really large tables (7,000,000 rows
> each,
> and very wide)?
> The table definitions are the same for both tables.
> Right now I am using a insert into statement with a selection from one
> table
> at a time, but it takes way too long. I don't need it to be logged. A all
> or nothing result is fine for me. I don't think DTS is an option because
> I
> need to run this from a C# app. I think my only other option is using the
> BCP api to select the data and load it into the new table, but this just
> seems like the wrong way to go.
> Any other ways to go with this?
> Shawn
>
Merge Large Tables
and very wide)?
The table definitions are the same for both tables.
Right now I am using a insert into statement with a selection from one table
at a time, but it takes way too long. I don't need it to be logged. A all
or nothing result is fine for me. I don't think DTS is an option because I
need to run this from a C# app. I think my only other option is using the
BCP api to select the data and load it into the new table, but this just
seems like the wrong way to go.
Any other ways to go with this?
ShawnSelect *
into [NewTable]
From
Select * [Table1]
Union All
Select * [Table2]
"Shawn Meyer" <me@.me.com> wrote in message
news:O4GhS6QQFHA.580@.TK2MSFTNGP15.phx.gbl...
> What is the best way to merge two really large tables (7,000,000 rows
> each,
> and very wide)?
> The table definitions are the same for both tables.
> Right now I am using a insert into statement with a selection from one
> table
> at a time, but it takes way too long. I don't need it to be logged. A all
> or nothing result is fine for me. I don't think DTS is an option because
> I
> need to run this from a C# app. I think my only other option is using the
> BCP api to select the data and load it into the new table, but this just
> seems like the wrong way to go.
> Any other ways to go with this?
> Shawn
>
Merge join: nr of output rows unchanged when amount of input changes
Dear all,
I created a package that seems to work fine with a small amount of data. When I run the package however with more data (as in production) the merge join output is limites to 9963 rows, no matter if I change the number of input rows.
Situation as follows.
The package has 2 OLE DB Sources, in which SQL-statements have been defined in order to retrieve the data.
The flow of source 1 is: retrieving source data -> trimming (non-key) columns -> sorting on the key-columns.
The flow of source 2 is: retrieving source data -> deriving 2 new columns -> aggregating the data to the level of source 1 -> sorting on the key columns.
Then both flows are merged and other steps are performed.
If I test with just a couple of rows it works fine. But when I change the where-clause in the data source retrieval, so that the number of rows is for instance 15000 or 150000 the number of rows after the merge join is 9963.
When I run the package in debug-mode the step is colored green, nevertheless an error is displayed:
Error: 0xC0047022 at Data Flow Task, DTS.Pipeline: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component "Merge Join" (4703) failed with error code 0xC0047020. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure.
To be honest, a few more errormessages appear, but they don't seem related to this issue. The package stops running after some 6000 rows have been written to the destination.
Any help will be greatly appreciated.
Kind regards,
Albert.
If you could post the full error output here, that would probably be helpful. Sometimes it is the "big view" that helps point you at the cause of the problem, especially since many of SSIS's error messages are not particularly transparent. The earlier errors often show what triggered the later errors, even if they do not appear directly related.
|||Can you also include details of the next task in the pipeline, the one that accepts the ~9000 rows. It sounds like it's failing on the first buffer it receives as input.|||Ok, for the big picture: I feel like a fool.
Solving one of the other errors solved the reported issue as well.
Apologies for bothering you.
Merge Join vs. Lookup vs. Custom Script - which is fastest?
Take your source and throw it through each of the above options and finally into a row counter. Compare the time it takes to get through the whole dataflow.|||
TheViewMaster wrote:
Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?
Only you can answer that question. test and measure test and measure, test and emasure.
-Jamie
|||Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?|||
TheViewMaster wrote:
Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?
OK. Well I am loath to give my opinions on performance comparisons but I'd lay alot of money to say that script transform will be slowest.
-Jamie
|||
If you do test the three methods, please post the results here. I am using custom script for lookups (small reference lists but millions of source rows in pipeline) but I would like to know how large reference lists perform.
|||For those posting to this thread and reading it, please watch the Webcast presented by Donald Farmer on performance and scale in SSIS. In there Donald talks about benchmarking and how to set up SSIS to obtain timings associated with different aspects of a package.
TechNet Webcast: SQL Server 2005 Integration Services: Performance and Scale (Level 400)
MS TechNet Event ID: 1032298087
I don't know if this link will work for anyone:
https://msevents.microsoft.com/CUI/Register.aspx?culture=en-US&EventID=1032298087&CountryCode=US&IsRedirect=false|||Where can I report a BUG about this forum - I have a 50/50 chance that when I try to create a hyperlink in my post - the Firefox crashes.
(Thank god I copied and pasted the following post to notepad before "doing the hyperlink trick")|||So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.
Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.
Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.
Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usage
My Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB
Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm|||
TheViewMaster wrote:
So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usageMy Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DBTest 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm
Excellent stuff. This is really valuable information. Thank you. I've updated my post with a link to here.
|||Yes thanks for posting very interesting info. Today I am going to change all my script lookups to use stringbuilder class and methods (strongly recommended in all the .net literature where performance is important when modifyng strings). Currently all my lookup script transforms use object based .net string variables which are notoriously terrible performers when the string values are repeatedly modified. Do you know which approach your script transform used? (assuming your are creating and modifying string variables in your lookup script)...
If I detect the same low processor usage in my script lookups I may also try and partition the pipeline to get a lookup to run with multiple threads...
Ken
|||My script does a lookup something similar to as described aforementioned Van Mullem article:Public Overrides Sub
PreExecute()
sqlCmd = New
SqlCommand("SELECT KeyCustomer, CustomerName
FROM tblCustomer WHERE(KeyCustomer = @.KeyCustomer)", sqlConn)
sqlParam = New
SqlParameter("@.KeyCustomer",
SqlDbType.Int)
sqlCmd.Parameters.Add(sqlParam)
End Sub
Public Overrides Sub
CustomerRecordsInput_ProcessInputRow(ByVal Row As CustomerRecordsInputBuffer)
Dim
reader As SqlDataReader
sqlCmd.Parameters("@.KeyCustomer").Value = Row.CUNO
reader = sqlCmd.ExecuteReader()
If
reader.Read() Then
Row.DirectRowToUpdateRecordsOutput()
Else
Row.DirectRowToInsertRecordsOutput()
End If
reader.Close()
End Sub
|||Ken - is your script performing a lookup from another source in pipeline?<boy i'd like to know how to do that
Also - any suggestions how to improve performance of OLE DB Update command?|||
Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.
A non cached lookup will be expremely slow as was your script component.
|||
Crispin wrote:
Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.
A non cached lookup will be expremely slow as was your script component.
It would be best to try to replicate full caching in the script component. The purpose of the exercise was to see which was faster. So, we know how fast (and legitimately so) the lookup component was, now how fast can we get the script component to process?
The question is how fast can each of the elements process their data, not how slow can we make them work.
Phil
Merge Join vs. Lookup vs. Custom Script - which is fastest?
Take your source and throw it through each of the above options and finally into a row counter. Compare the time it takes to get through the whole dataflow.|||
TheViewMaster wrote:
Very often we have 50'000+ rows which you need to pull values from different source (e.g. CityNames from citycode in Excel file). Currently we are using Lookup - but the questions is which of those 3 options is best in performance wise?
Only you can answer that question. test and measure test and measure, test and emasure.
-Jamie
|||Thanks guys for your answers - I will try it out performance testing this weekend on my free time.So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?|||
TheViewMaster wrote:
Thanks guys for your answers - I will try it out performance testing this weekend on my free time.
So far it has seemed to me merge join is slower than lookup, however, lookup seems to take much longer than i like it to - so i was wondering if creating a script transform would be better solution... Just wanted to get an idea - based on you experience which option do you use?
OK. Well I am loath to give my opinions on performance comparisons but I'd lay alot of money to say that script transform will be slowest.
-Jamie
|||If you do test the three methods, please post the results here. I am using custom script for lookups (small reference lists but millions of source rows in pipeline) but I would like to know how large reference lists perform.
|||For those posting to this thread and reading it, please watch the Webcast presented by Donald Farmer on performance and scale in SSIS. In there Donald talks about benchmarking and how to set up SSIS to obtain timings associated with different aspects of a package.TechNet Webcast: SQL Server 2005 Integration Services: Performance and Scale (Level 400)
MS TechNet Event ID: 1032298087
I don't know if this link will work for anyone:
https://msevents.microsoft.com/CUI/Register.aspx?culture=en-US&EventID=1032298087&CountryCode=US&IsRedirect=false|||Where can I report a BUG about this forum - I have a 50/50 chance that when I try to create a hyperlink in my post - the Firefox crashes.
(Thank god I copied and pasted the following post to notepad before "doing the hyperlink trick")|||So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.
Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.
Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.
Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usage
My Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB
Test 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm|||
TheViewMaster wrote:
So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usageMy Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DBTest 1 & 2 are based on Jamie Thomson article - http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx
Test 3 is based on Greg Van Mullem article - http://www.mathgv.com/sql2005docs/SSISTransformScriptETL.htm
Excellent stuff. This is really valuable information. Thank you. I've updated my post with a link to here.
|||Yes thanks for posting very interesting info. Today I am going to change all my script lookups to use stringbuilder class and methods (strongly recommended in all the .net literature where performance is important when modifyng strings). Currently all my lookup script transforms use object based .net string variables which are notoriously terrible performers when the string values are repeatedly modified. Do you know which approach your script transform used? (assuming your are creating and modifying string variables in your lookup script)...
If I detect the same low processor usage in my script lookups I may also try and partition the pipeline to get a lookup to run with multiple threads...
Ken
|||My script does a lookup something similar to as described aforementioned Van Mullem article:Public Overrides Sub PreExecute()
sqlCmd = New SqlCommand("SELECT KeyCustomer, CustomerName FROM tblCustomer WHERE(KeyCustomer = @.KeyCustomer)", sqlConn)
sqlParam = New SqlParameter("@.KeyCustomer", SqlDbType.Int)
sqlCmd.Parameters.Add(sqlParam)
End Sub
Public Overrides Sub CustomerRecordsInput_ProcessInputRow(ByVal Row As CustomerRecordsInputBuffer)
Dim reader As SqlDataReader
sqlCmd.Parameters("@.KeyCustomer").Value = Row.CUNO
reader = sqlCmd.ExecuteReader()
If reader.Read() Then
Row.DirectRowToUpdateRecordsOutput()
Else
Row.DirectRowToInsertRecordsOutput()
End If
reader.Close()
End Sub
|||Ken - is your script performing a lookup from another source in pipeline?<boy i'd like to know how to do that
Also - any suggestions how to improve performance of OLE DB Update command?|||
Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.
A non cached lookup will be expremely slow as was your script component.
|||Crispin wrote:
Do a fair comparison though. Either change your query to cache the rows from SQL or disable caching on the lookup. Oranges != Apples.
A non cached lookup will be expremely slow as was your script component.
It would be best to try to replicate full caching in the script component. The purpose of the exercise was to see which was faster. So, we know how fast (and legitimately so) the lookup component was, now how fast can we get the script component to process?
The question is how fast can each of the elements process their data, not how slow can we make them work.
Phil
Wednesday, March 28, 2012
Merge apparent duplcate rows into 1 row?
I have a select query that can generate apparent duplicates; this
occurs because the Histology value is determined from a table
tblSample, this may contain a number of samples for the same location
as there are different methods of obtaining samples sometimes 2 or more
method are used to back up results. The method is not important for
this table and so not shown, so showing apparent duplicates. Heres an
example of the table:
Code Date Location Histology
---
CO123 12/08/2005 Left Main Adeno
CO123 12/08/2005 Left Main Adeno
BJ234 12/08/2005 Right Main Normal
BJ234 12/08/2005 Right Lower Squamous
CH345 17/08/2005 Right Middle Normal
This is my SQL:
SELECT tblPatient.pntCode AS Code, tblPDT.pdtDate AS Date,
tblLesion.lesLocation AS Location,
tblSample.splHistology AS Histology
FROM tblPatient, tblPDT, tblLesion, tblSample
WHERE tblPatient.patientNo = tblPDT.patientNo
AND tblPatient.patientNo = tblLesion.patientNo
AND tblLesion.lesNo = tblSample.lesNo
Is there a way to combine these apparent duplicate rows into one row?
Essentially doing:
If no of rows where Code, Date, Location, match > 1
Delete rows >= 2
ThanksUse a SELECT DISTINCT:
SELECT DISTINCT
tblPatient.pntCode AS Code, tblPDT.pdtDate AS Date,
tblLesion.lesLocation AS Location,
tblSample.splHistology AS Histology
FROM tblPatient, tblPDT, tblLesion, tblSample
WHERE tblPatient.patientNo = tblPDT.patientNo
AND tblPatient.patientNo = tblLesion.patientNo
AND tblLesion.lesNo = tblSample.lesNo
Tom
----
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Columnist, SQL Server Professional
Toronto, ON Canada
www.pinpub.com
.
"Assimalyst" <c_oxtoby@.hotmail.com> wrote in message
news:1125055041.994620.74900@.g14g2000cwa.googlegroups.com...
Hi,
I have a select query that can generate apparent duplicates; this
occurs because the Histology value is determined from a table
tblSample, this may contain a number of samples for the same location
as there are different methods of obtaining samples sometimes 2 or more
method are used to back up results. The method is not important for
this table and so not shown, so showing apparent duplicates. Heres an
example of the table:
Code Date Location Histology
---
CO123 12/08/2005 Left Main Adeno
CO123 12/08/2005 Left Main Adeno
BJ234 12/08/2005 Right Main Normal
BJ234 12/08/2005 Right Lower Squamous
CH345 17/08/2005 Right Middle Normal
This is my SQL:
SELECT tblPatient.pntCode AS Code, tblPDT.pdtDate AS Date,
tblLesion.lesLocation AS Location,
tblSample.splHistology AS Histology
FROM tblPatient, tblPDT, tblLesion, tblSample
WHERE tblPatient.patientNo = tblPDT.patientNo
AND tblPatient.patientNo = tblLesion.patientNo
AND tblLesion.lesNo = tblSample.lesNo
Is there a way to combine these apparent duplicate rows into one row?
Essentially doing:
If no of rows where Code, Date, Location, match > 1
Delete rows >= 2
Thanks|||Thanks Tom, i'm pretty new to SQL so not very familiar with the syntax
yet. Glad this one was an easy fix!|||If you're new to SQL, this is the place to hang out. :-)
Tom
----
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Columnist, SQL Server Professional
Toronto, ON Canada
www.pinpub.com
.
"Assimalyst" <c_oxtoby@.hotmail.com> wrote in message
news:1125058223.511501.191000@.o13g2000cwo.googlegroups.com...
Thanks Tom, i'm pretty new to SQL so not very familiar with the syntax
yet. Glad this one was an easy fix!
Monday, March 26, 2012
Merge always deletes inserts on subscriber
Table1 replicates to table3
Table2, just another table with the same schema as Table1 and 3.
In my setup, I created a copy table, copied 66 rows from table1 on publisher
to table2 on subscriber and deleted all 66 rows. The changes were replicated
to the subscriber to Table2(66 deletes made). Now i insert all 66 rows from
Table2 to table3 on the subscriber side. The change should have gone to
publisher. Instead, the inserts get deleted on the subscriber... Why would
SQL do this?
The inserts done later should be taken as new changes on subscriber and
should be replicated to publisher, isn't it?
Forgot to add, it took it as a conflict between pub and sub and for me, by
default, the pub wins... The problem with this is though, it shouldn't be a
conflict. I've ADDED these rows right now and ONLY to the subscriber...
Any clues.
Friday, March 23, 2012
Merge 2 columns from the same table
I have a table with 2 columns A1 and Firm. A1 holds "dirty data" and there are no NULL values. Firm holds "clean data" for some of the rows in A1 but not for all. So there are quite a few NULL values in this column. I want to replace the value in A1 if there is any data in Firm (value <> NULL).
Rigth now, I solve this issue with a simple VB.NET script:
Public Class ScriptMain
Inherits UserComponent
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
'
If Row.Firm_IsNull Then
Row.Standard = Row.A1
Else
Row.Standard = Row.Firm
End If
'
End Sub
End Class
Any ideas how to solve this issue without custome code?
Thanks!
It sounds to me like you can use Derived Column transformation:
Repalce A1:
With an expression like
ISNULL(Firm) ? A1 : Firm
Rafael Salas
Monday, March 12, 2012
Memory usage
I need to know which of the following two methods do need less RAM.
There are 2 big tables, each about 9 M rows, and 6 small dimension tables with each about 10 to 100 Rows. The dimension tables are joined by their id's with one of the big table.
The Structure of a dimension Table looks like
CarID (tinyint), Description (varchar(20))
1 BMW
2 Porsche
I want to join the 2 Big Tables in a materialized view. Later i will run queries like
select * into #temp from dbo.vw_materialized_view where Car = 'BMW'
So, back to my question, will such a query take less memory (ram) when i joined all 8 tables before I created the mat. view or will it take less when I only join the 2 big tables in a mat.view and later join the mat.view with the 6 dimension tables?
Hope you got that ;-)
Thank youmemory usage will be managed by sql server
If you create an index on a view, then that data will be stored just like a base table, so you incur more overhead and disk storage, or if it's small enough, in memory
But all of that is managed by sql server
And if you don't index the view and the joins afre simple enough, then it'll use the indexes on the table
What was the question again?
Memory Usage
"When I get a big result by a query (such as 1M of rows), the memory usage
of my system decreasing critically. What is the phenomenon that causes this
behaviour?"
Ok it is usual, because I can still observe the rows in the result pane
below. But, Although I close the pane, the mem.usage is still high.
I think it occurs because of the cached result in order to respond later
requests in a fast manner.
Can you please make a comment about the subject?
AND, I would like to learn if there is a way to release/flush this memory by
the Query Analyzer.That happens by design. It's not a flaw or memory leakage.
It is really not the Query Analyser that is "eating the memory", but
instead SQL Server that is behaving according to the Memory settings
established on Enterprise Manager.
Open EM, right click the server registration you wish to check and
click Properties. Move to the memory tab. It is recommended that the
server is configured to dynamically allocate memory.
SQL Server will use as much memory as it can. As other process may
start requesting memory, it will release it. Untill then, it will keep
his own memory usage. With queries being run on the Query Analyser, the
memory usage will keep increasing just short of having to start paging.
I know of no way to programatically release this memory. However, I
don't think this is an issue, since any other process that needs memory
will have it released by SQL Server.
Wednesday, March 7, 2012
Memory Problem
Hi,
I need help, Im workin with 30,000,000 rows in Asociation Rules algorithm, but, when process the model, it fail for memory problem.
PLEASE help me
Charlie
Hello, CharlieHere are a few suggestions that would reduce the memory pressure on the server:
- Increase the value of the MINIMUM_SUPPORT parameter. The default (0.03) value requires a group of items to appear in at least 3% of the transactions to be considered for the left hand side of a rule. By increasing it, the algorithm will analyze fewer items
- Decrease the MAXIMUM_ITEMSET_COUNT parameter.
- Decrease the value of MAXIMUM_ITEMSET_SIZE or increase the value of MINIMUM_ITEMSET_SIZE. If shorter rules correlating two or three items are OK, then decrease the MAX... parameter (the default value is 3, which generates rules with up to 3 items on the left hand side).
Hope this helps|||
Can you recommended a good source for reading up on tuning Analysis Services and in particular the mining model viewer? Fixing problems is one thing, but I'd also like to build up a good foundation of knowledge.
Thanks,
Nick
|||Hi Bogdan, Thank you!
I reduce the parameters, but this model is important to a supermarket and they dont want reduce some parameters because yours business.
I was reading, Its posible that the problem is support in 32 bits?. The server have 4 procesor and 16 GB of RAM.
Please any suggestion
HELP ME!!
Charlie
|||Carlos, can you post the exact error message that you receive? That may help us to narrow down the issue.
Thanks
|||Are you running the 64 bit or 32 bit version of Analysis Services? With the 32 bit version, the server cannot take advantage of the 16 GB of RAM.
|||Hi,
The Error is : Server Timeout. the procces has been canceled.
umm, There is limit of records in a platform of 32 bits?
The process run for 3 hours and begins to read the cases, when it read the last cases it is when fails.
Thank you for the help!
Charlie
|||
cbdl10 wrote:
Hi,
I need help, Im workin with 30,000,000 rows in Asociation Rules algorithm, but, when process the model, it fail for memory problem.
PLEASE help me
Charlie
hi,
i've got the same problem as yours ..
I can't find a solution yet .. did u?
is it correct considering a MAXIMUM_ITEMSET_COUNT equal to count(*) of my fact table?
[if i reduces my fact table, which means that if i have a fact table working with a 180,000 rows i'll consider only 9,000rows randomly and even in this situation, the algorithm lasts 25min to be accomplished]
|||
cbdl10 wrote:
Hi,
I need help, Im workin with 30,000,000 rows in Asociation Rules algorithm, but, when process the model, it fail for memory problem.
hi there,
i've got the same problem as you ... I'm working with 180,000 records fact table and i've memory problem too (32bit, 2gb ram).
I tried to sample my fact table, by using only 9,000 records randomly chosen but it lasts more or less 25minutes. So, i decided to try to tune algorithm parameters:
MAXIMUM_ITEMSET_COUNT: 9,000;
MAXIMUM_ITEMSET_SIZE: 2;
MINIMU_SUPPORT = 0.1;
It executes very fast, but it doesn't show any result .. does a tuning howto exist? did u solve your problem?
Memory Problem
Hi,
I need help, Im workin with 30,000,000 rows in Asociation Rules algorithm, but, when process the model, it fail for memory problem.
PLEASE help me
Charlie
Hello, CharlieHere are a few suggestions that would reduce the memory pressure on the server:
- Increase the value of the MINIMUM_SUPPORT parameter. The default (0.03) value requires a group of items to appear in at least 3% of the transactions to be considered for the left hand side of a rule. By increasing it, the algorithm will analyze fewer items
- Decrease the MAXIMUM_ITEMSET_COUNT parameter.
- Decrease the value of MAXIMUM_ITEMSET_SIZE or increase the value of MINIMUM_ITEMSET_SIZE. If shorter rules correlating two or three items are OK, then decrease the MAX... parameter (the default value is 3, which generates rules with up to 3 items on the left hand side).
Hope this helps|||
Can you recommended a good source for reading up on tuning Analysis Services and in particular the mining model viewer? Fixing problems is one thing, but I'd also like to build up a good foundation of knowledge.
Thanks,
Nick
|||Hi Bogdan, Thank you!
I reduce the parameters, but this model is important to a supermarket and they dont want reduce some parameters because yours business.
I was reading, Its posible that the problem is support in 32 bits?. The server have 4 procesor and 16 GB of RAM.
Please any suggestion
HELP ME!!
Charlie
|||Carlos, can you post the exact error message that you receive? That may help us to narrow down the issue.
Thanks
|||Are you running the 64 bit or 32 bit version of Analysis Services? With the 32 bit version, the server cannot take advantage of the 16 GB of RAM.
|||Hi,
The Error is : Server Timeout. the procces has been canceled.
umm, There is limit of records in a platform of 32 bits?
The process run for 3 hours and begins to read the cases, when it read the last cases it is when fails.
Thank you for the help!
Charlie
|||
cbdl10 wrote:
Hi,
I need help, Im workin with 30,000,000 rows in Asociation Rules algorithm, but, when process the model, it fail for memory problem.
PLEASE help me
Charlie
hi,
i've got the same problem as yours ..
I can't find a solution yet .. did u?
is it correct considering a MAXIMUM_ITEMSET_COUNT equal to count(*) of my fact table?
[if i reduces my fact table, which means that if i have a fact table working with a 180,000 rows i'll consider only 9,000rows randomly and even in this situation, the algorithm lasts 25min to be accomplished]
|||
cbdl10 wrote:
Hi,
I need help, Im workin with 30,000,000 rows in Asociation Rules algorithm, but, when process the model, it fail for memory problem.
hi there,
i've got the same problem as you ... I'm working with 180,000 records fact table and i've memory problem too (32bit, 2gb ram).
I tried to sample my fact table, by using only 9,000 records randomly chosen but it lasts more or less 25minutes. So, i decided to try to tune algorithm parameters:
MAXIMUM_ITEMSET_COUNT: 9,000;
MAXIMUM_ITEMSET_SIZE: 2;
MINIMU_SUPPORT = 0.1;
It executes very fast, but it doesn't show any result .. does a tuning howto exist? did u solve your problem?