Archive for December, 2014
to get all the rest of the Tips
please visit our homepage for my opening video about the course
Top 10 tips for data integration with pentaho kettle : tips #4-#6
Tip #4: there is no right or wrong
Pentaho kettle is a wonderful tool with a lot of functionality and options.
In fact you can do almost anything in more than one way.
So you are probably have a dilemma on how to solve a specific task, and you go to the Internet and try to find the right perfect way to solve it.
You need to look at pentaho as a toolbox. You can use your imagination to solve it.
It is just like asking what is the right material to build a house?
- Wood
- Aluminum
- Steel
- Concrete
- Mongolian tent(YURT)
- Igloo
- Green construction
- Others?
you can use each one of them, you need to decide.
It is just a matter of complexity (how much you need to work in order to make it functional) optimization and maintenance.
In my course I demonstrate how you can use several steps to make the same thing.
For example I did a comparison between table output, update/insert, just update, bulk load.
They all load the data into the database, the question of when do I use which?
You can buy my course, there are a lot more tips and case studies.
By the way I, myself, solve most of my complex jobs while walking. In order to do that you need to know most of pentaho kettle steps by heart.
I’m sure you’ll get there, you just need persistence and customers with high demands (and money of course)
Tip #5: use variables.
Don’t you hate it that the customer calls you two years after the project finished and blame you that the solution you developed doesn’t work anymore?
We all know he changed something. Transformation and jobs doesn’t get broken by themselves.
It can be one of two options:
a. A manager tried to save some bucks and did it on his own, while damaging the project in the process.
b. The customer changed the environment, meaning: servers, network, computer names, folders and more.
Unfortunately, there is no cure for dumb customers but for changing the environment there is. Use variables.
I know it’s easier and faster to just hard copy everything inside the transformation and jobs but it is not sustainable.
As soon as the customer will change something instead of changing Parameter on the kettle.properties or in the tables, you will need to find all the places you hardcoded IP, names, folder, file names and such.
Think even on the simple scenario that you want a QA environment, the parameters are different between the production and the QA.
You can skip a lot of steps on the way like logging, e-mail, error handling but don’t skip variables. The others will just make you work a little harder. Lack of variables would make you suffer.
Tip #6: Note on every transformation
Ever tried to read code that somebody else wrote and your job was to alter it?
Well that’s a pain right?
So don’t do to others what you don’t want others to do to you.
Hell, don’t do it for yourself.
You never know when would be the next time you need to examine the transformation and change something in it. Always note.
Imagine yourself reading C sharp code of 3000 rows without remarks, the developer thoughts he is developing it for himself and no one else, ever, will read it.
Even he would find out that three years later, he doesn’t remember anything about that code. So think about transformation as code, little widgets that are part of a machine that do great things.
Note.
If you want to get the all e-book free access, please subscribe to our newsletter
(scroll up its on the right side – enter your email)
or you can goto the next post to read
click here for : Top 10 tips for data integration with pentaho kettle whats next
Top 10 tips for data integration with pentaho kettle : tips #1-#3
Tip #1: keep it simple – divide and conquer
There are developers that tend to complicate stuff.
Pentaho kettle will not save you from yourself, nor any other ETL tool
you need to figure out how to keep the transformation simple and readable for your own sake.
If you’re not going to control your creativity while “playing with steps” you can find yourself with a spaghetti.
Now, it could work, but in case it doesn’t “you have a problem-Houston”
Good example for a bad example:
Now think what really happens if you have an error somewhere,
I don’t know if you have real experience hands on with pentaho kettle but you soon find out
that sometimes the logging are not enough to understand the problem and how to solve it.
My suggestion is: divide and conquer
you have many steps in kettle that can move data from one transformation to another,
also, sometimes you can just load it to a table and then extract again on the next transformation.
Yeah, I understand, it sounds like over doing and unoptimized. But is it a matter of the specifics scenario.
So if you have one transformation that loads 20,000 rows,
it’s not a problem to load them and then on the next transformation to extractfrom the table again.
You can also use the “copy rows results” step and then to use it on your next transformation with “get rows from results”.
Another option is to use a temporary file. Kettle writes and reads very fast from files.
Again it’s a matter of resources and database size.
You’ll need to act differently if you have 10 billion rows.
When dividing the large transformation into smaller transformations you can control
where the problem/error occur and to pinpoint the problem thus find the solution a lot quicker and easy.
On the other hand making the transformation too short is also a problem,
let’s say you have a MySQL database and at your first step use
“table input” and on the next step the use “table output”
Your SQL statement looks like that:
In this situation you have only two steps, but you actually didn’t use pentaho
kettle the right way, that is because you did everything in the SQL statement.
So if you have an error and you need to find it. You will need to dissect the SQL.
Examples:
a. change names of fields
b. convert data types
c. union
My suggestion:
use the table input in order to bring raw data, the only exceptions are:
a. joins
b. order by
Those two are better to do in the source,
meaning: do not bring all the rows to pentaho kettle and then “join” or “order by” if you don’t have to.
Tip #2: do not over optimize:
We, IT guys and girls engineers, feel that if it’s not perfect we failed.
So we spend a lot of the time and money (usually customers money) in order to make it perfect and flawless
.
So for example: (from a real customer of course)
Two engineers spent 4 days to optimize a job in order to improve its execution time from 5 minutes to 3 minutes.
they were so happy when they succeeded. But what was the point?
It doesn’t have to be perfect, it needs to work well and stable in the time frame the customer defines.
So it is a different story if the customer needs the data to be updated every 4 min. and it runs for 5 minutes.
In our case, the customer wanted it to be updated only once a night and it ran for 30 minutes.
So if you like to fix stuff that are not broken do it on your own time and money.
We want the transformation and jobs to be optimal,
a little secret:
you can always make your jobs and transformations faster.
It’s a matter of the time you need to spend and the importance of the optimization.
Don’t worry, you will find out that the customer invents crazy transformations and jobs all on his own.
You don’t need to invent complexity, it’s there all the time anyway.
Tip #3: detailed design, baby.
Always analyze the processes that need to be computerized.
The customer doesn’t know what you really need, when you ask the right questions you’ll find out the truth about the project.
You don’t need to overdo it of course, but just enough to get the full picture and when you do – write it down and let the customer sign it.
This is the base of your work, your project, your agreement with the customer.
You will find out that the customer “forget” what was the initial analysis and the project changes over time.
So a question like “do you need to keep history about changes of the relation between the customer and the agent” becomes very important if he change his mind.
Because now you need to change the data integration and make it slowly changing dimension type II instead of type I.
So it will probably take a lot more hours of development, quality analysis, dashboards and reports.
The customer doesn’t understand the consequences of his answers so you need to be very precise when asking questions.
It’s okay that the customer changes his mind during the project, but you need to be sure that he is going to pay for the changes, also he must understand that the amount of hours you “predicted” at first is changing now, so it will not come by surprise.
Sub tip: It would always get complicated.
Even if the customer says it’s simple, even if you think it’s simple, even if you know the data source and you did it three times before.
You know the database, the target, the idea, and the solution – it will get complicated.
The customer suddenly will say “what about the other servers?”
And the question is “what other servers”?
In just became complex.
And the answer… “Those that we have in Malta.”
“We forgot to tell you it’s coming from Cassandra…”
“Yes, it is not MySQL its Json”
“We forgot to tell you, it’s not our web service, it’s a third-party CRM solution that was invented by three kids in India”.
From experience – no easy projects, no rest for the brave.
If you want to get the all e-book free access, please subscribe to our newsletter
(scroll up its on the right side – enter your email)
or you can goto the next post to read