Top 10 tips for data integration with pentaho kettle : tips #1-#3
Tip #1: keep it simple – divide and conquer
There are developers that tend to complicate stuff.
Pentaho kettle will not save you from yourself, nor any other ETL tool
you need to figure out how to keep the transformation simple and readable for your own sake.
If you’re not going to control your creativity while “playing with steps” you can find yourself with a spaghetti.
Now, it could work, but in case it doesn’t “you have a problem-Houston”
Good example for a bad example:
Now think what really happens if you have an error somewhere,
I don’t know if you have real experience hands on with pentaho kettle but you soon find out
that sometimes the logging are not enough to understand the problem and how to solve it.
My suggestion is: divide and conquer
you have many steps in kettle that can move data from one transformation to another,
also, sometimes you can just load it to a table and then extract again on the next transformation.
Yeah, I understand, it sounds like over doing and unoptimized. But is it a matter of the specifics scenario.
So if you have one transformation that loads 20,000 rows,
it’s not a problem to load them and then on the next transformation to extractfrom the table again.
You can also use the “copy rows results” step and then to use it on your next transformation with “get rows from results”.
Another option is to use a temporary file. Kettle writes and reads very fast from files.
Again it’s a matter of resources and database size.
You’ll need to act differently if you have 10 billion rows.
When dividing the large transformation into smaller transformations you can control
where the problem/error occur and to pinpoint the problem thus find the solution a lot quicker and easy.
On the other hand making the transformation too short is also a problem,
let’s say you have a MySQL database and at your first step use
“table input” and on the next step the use “table output”
Your SQL statement looks like that:
In this situation you have only two steps, but you actually didn’t use pentaho
kettle the right way, that is because you did everything in the SQL statement.
So if you have an error and you need to find it. You will need to dissect the SQL.
Examples:
a. change names of fields
b. convert data types
c. union
My suggestion:
use the table input in order to bring raw data, the only exceptions are:
a. joins
b. order by
Those two are better to do in the source,
meaning: do not bring all the rows to pentaho kettle and then “join” or “order by” if you don’t have to.
Tip #2: do not over optimize:
We, IT guys and girls engineers, feel that if it’s not perfect we failed.
So we spend a lot of the time and money (usually customers money) in order to make it perfect and flawless
.
So for example: (from a real customer of course)
Two engineers spent 4 days to optimize a job in order to improve its execution time from 5 minutes to 3 minutes.
they were so happy when they succeeded. But what was the point?
It doesn’t have to be perfect, it needs to work well and stable in the time frame the customer defines.
So it is a different story if the customer needs the data to be updated every 4 min. and it runs for 5 minutes.
In our case, the customer wanted it to be updated only once a night and it ran for 30 minutes.
So if you like to fix stuff that are not broken do it on your own time and money.
We want the transformation and jobs to be optimal,
a little secret:
you can always make your jobs and transformations faster.
It’s a matter of the time you need to spend and the importance of the optimization.
Don’t worry, you will find out that the customer invents crazy transformations and jobs all on his own.
You don’t need to invent complexity, it’s there all the time anyway.
Tip #3: detailed design, baby.
Always analyze the processes that need to be computerized.
The customer doesn’t know what you really need, when you ask the right questions you’ll find out the truth about the project.
You don’t need to overdo it of course, but just enough to get the full picture and when you do – write it down and let the customer sign it.
This is the base of your work, your project, your agreement with the customer.
You will find out that the customer “forget” what was the initial analysis and the project changes over time.
So a question like “do you need to keep history about changes of the relation between the customer and the agent” becomes very important if he change his mind.
Because now you need to change the data integration and make it slowly changing dimension type II instead of type I.
So it will probably take a lot more hours of development, quality analysis, dashboards and reports.
The customer doesn’t understand the consequences of his answers so you need to be very precise when asking questions.
It’s okay that the customer changes his mind during the project, but you need to be sure that he is going to pay for the changes, also he must understand that the amount of hours you “predicted” at first is changing now, so it will not come by surprise.
Sub tip: It would always get complicated.
Even if the customer says it’s simple, even if you think it’s simple, even if you know the data source and you did it three times before.
You know the database, the target, the idea, and the solution – it will get complicated.
The customer suddenly will say “what about the other servers?”
And the question is “what other servers”?
In just became complex.
And the answer… “Those that we have in Malta.”
“We forgot to tell you it’s coming from Cassandra…”
“Yes, it is not MySQL its Json”
“We forgot to tell you, it’s not our web service, it’s a third-party CRM solution that was invented by three kids in India”.
From experience – no easy projects, no rest for the brave.
If you want to get the all e-book free access, please subscribe to our newsletter
(scroll up its on the right side – enter your email)
or you can goto the next post to read
click here for pentaho kettle data integration tip #3-#6
Trackback from your site.