Top 10 tips for data integration with pentaho kettle : tips #4-#6

Tip #4: there is no right or wrong

Pentaho kettle is a wonderful tool with a lot of functionality and options.
In fact you can do almost anything in more than one way.
So you are probably have a dilemma on how to solve a specific task, and you go to the Internet and try to find the right perfect way to solve it.
You need to look at pentaho as a toolbox. You can use your imagination to solve it.
It is just like asking what is the right material to build a house?pentaho data integration

  • Wood
  • Aluminum
  • Steel
  • Concrete
  • Mongolian tent(YURT)
  • Igloo
  • Green construction
  • Others?

you can use each one of them, you need to decide.
It is just a matter of complexity (how much you need to work in order to make it functional) optimization and maintenance.
In my course I demonstrate how you can use several steps to make the same thing.
For example I did a comparison between table output, update/insert, just update, bulk load.

They all load the data into the database, the question of when do I use which?
You can buy my course, there are a lot more tips and case studies.
By the way I, myself, solve most of my complex jobs while walking. In order to do that you need to know most of pentaho kettle steps by heart.
I’m sure you’ll get there, you just need persistence and customers with high demands (and money of course)

Tip #5: use variables.

Don’t you hate it that the customer calls you two years after the project finished and blame you that the solution you developed doesn’t work anymore?
We all know he changed something. Transformation and jobs doesn’t get broken by themselves.

It can be one of two options:pentaho data integration wheel
a. A manager tried to save some bucks and did it on his own, while damaging the project in the process.
b. The customer changed the environment, meaning: servers, network, computer names, folders and more.
Unfortunately, there is no cure for dumb customers but for changing the environment there is. Use variables.

I know it’s easier and faster to just hard copy everything inside the transformation and jobs but it is not sustainable.
As soon as the customer will change something instead of changing Parameter on the or in the tables, you will need to find all the places you hardcoded IP, names, folder, file names and such.
Think even on the simple scenario that you want a QA environment, the parameters are different between the production and the QA.

You can skip a lot of steps on the way like logging, e-mail, error handling but don’t skip variables. The others will just make you work a little harder. Lack of variables would make you suffer.

Tip #6: Note on every transformation

Ever tried to read code that somebody else wrote and your job was to alter it?
Well that’s a pain right?
So don’t do to others what you don’t want others to do to you.
Hell, don’t do it for yourself.

You never know when would be the next time you need to examine the transformation and change something in it. Always note.
Imagine yourself reading C sharp code of 3000 rows without remarks, the developer thoughts he is developing it for himself and no one else, ever, will read it.

Even he would find out that three years later, he doesn’t remember anything about that code. So think about transformation as code, little widgets that are part of a machine that do great things.

If you want to get the all e-book free access, please subscribe to our newsletter
(scroll up its on the right side – enter your email)

or you can goto the next post to read

click here for : Top 10 tips for data integration with pentaho kettle whats next

Top 10 tips for data integration with pentaho kettle : tips #1-#3

Tip #1: keep it simple – divide and conquer

There are developers that tend to complicate stuff.
Pentaho kettle will not save you from yourself, nor any other ETL tool
you need to figure out how to keep the transformation simple and readable for your own sake.

If you’re not going to control your creativity while “playing with steps” you can find yourself with a spaghetti.
Now, it could work, but in case it doesn’t “you have a problem-Houston”
Good example for a bad example:

pentaho kettle spageti schemaNow think what really happens if you have an error somewhere,
I don’t know if you have real experience hands on with pentaho kettle but you soon find out
that sometimes the logging are not enough to understand the problem and how to solve it.
My suggestion is: divide and conquer
you have many steps in kettle that can move data from one transformation to another,
also, sometimes you can just load it to a table and then extract again on the next transformation.
Yeah, I understand, it sounds like over doing and unoptimized. But is it a matter of the specifics scenario.

So if you have one transformation that loads 20,000 rows,
it’s not a problem to load them and then on the next transformation to extractfrom the table again.
You can also use the “copy rows results” step and then to use it on your next transformation with “get rows from results”.
Another option is to use a temporary file. Kettle writes and reads very fast from files.
Again it’s a matter of resources and database size.
You’ll need to act differently if you have 10 billion rows.
When dividing the large transformation into smaller transformations you can control
where the problem/error occur and to pinpoint the problem thus find the solution a lot quicker and easy.

On the other hand making the transformation too short is also a problem,
let’s say you have a MySQL database and at your first step use
“table input” and on the next step the use “table output”

Your SQL statement looks like that:kettle pentaho sql code

In this situation you have only two steps, but you actually didn’t use pentaho
kettle the right way, that is because you did everything in the SQL statement.
So if you have an error and you need to find it. You will need to dissect the SQL.

a. change names of fields
b. convert data types
c. union
My suggestion:
use the table input in order to bring raw data, the only exceptions are:
a. joins
b. order by
Those two are better to do in the source,
meaning: do not bring all the rows to pentaho kettle and then “join” or “order by” if you don’t have to.

Tip #2: do not over optimize:

We, IT guys and girls engineers, feel that if it’s not perfect we failed.
So we spend a lot of the time and money (usually customers money) in order to make it perfect and flawlesspentaho spoon.

So for example: (from a real customer of course)
Two engineers spent 4 days to optimize a job in order to improve its execution time from 5 minutes to 3 minutes.
they were so happy when they succeeded. But what was the point?

It doesn’t have to be perfect, it needs to work well and stable in the time frame the customer defines.
So it is a different story if the customer needs the data to be updated every 4 min. and it runs for 5 minutes.
In our case, the customer wanted it to be updated only once a night and it ran for 30 minutes.
So if you like to fix stuff that are not broken do it on your own time and money.

We want the transformation and jobs to be optimal,

a little secret: 
you can always make your jobs and transformations faster.
It’s a matter of the time you need to spend and the importance of the optimization.
Don’t worry, you will find out that the customer invents crazy transformations and jobs all on his own.
You don’t need to invent complexity, it’s there all the time anyway.

Tip #3: detailed design, baby.

Always analyze the processes that need to be computerized.

The customer doesn’t know what you really need, when you ask the right questions you’ll find out the truth about the project.

You don’t need to overdo it of course, but just enough to get the full picture and when you do – write it down and let the customer sign it.
This is the base of your work, your project, your agreement with the customer.

You will find out that the customer “forget” what was the initial analysis and the project changes over time.

So a question like “do you need to keep history about changes of the relation between the customer and the agent” becomes very important if he change his mind.pentaho spoon
Because now you need to change the data integration and make it slowly changing dimension type II instead of type I.
So it will probably take a lot more hours of development, quality analysis, dashboards and reports.
The customer doesn’t understand the consequences of his answers so you need to be very precise when asking questions.
It’s okay that the customer changes his mind during the project, but you need to be sure that he is going to pay for the changes, also he must understand that the amount of hours you “predicted” at first is changing now, so it will not come by surprise.

Sub tip: It would always get complicated.

Even if the customer says it’s simple, even if you think it’s simple, even if you know the data source and you did it three times before.
You know the database, the target, the idea, and the solution – it will get complicated.

The customer suddenly will say “what about the other servers?”
And the question is “what other servers”?

In just became complex.
And the answer… “Those that we have in Malta.”

“We forgot to tell you it’s coming from Cassandra…”
“Yes, it is not MySQL its Json”

“We forgot to tell you, it’s not our web service, it’s a third-party CRM solution that was invented by three kids in India”.

From experience – no easy projects, no rest for the brave.

If you want to get the all e-book free access, please subscribe to our newsletter
(scroll up its on the right side – enter your email)

or you can goto the next post to read

click here for pentaho kettle data integration tip #3-#6

For those who consider to become a data integration (spoon) pentaho kettle developer

you need to be processed oriented, it is not that programming or PHP it’s more like understanding the requirements from the customer, understand the flow of how the customer runs his business and translate it to a system that allows him to get real information so he/she can decide about actions the company needs to take.

So you need a combination of: understand managers and needs but at the same time to be a technological skilled in order to implement it the right way.
I think the most interesting thing about it is that you become more acquainted of the business than the managers, and because you see the full picture you know what questions need to be asked before starting the project.

What I like in this process the most is to analyze the requirements and then implement it.
From my point of view pentaho kettle is a playground, like a puzzle / logic games that need to be solved. And they pay me for that.

So if you feel that you’re like that, you enjoy interviewing managers and then try to find the right solution to bring data from ridiculous sources and try combining them to one place in logic that will give the manager a way to analyze and decide according.
You came to the right place.

pre requirements (ingredients) for the pentaho kettle course

I built the course for all users of pentaho kettle developers from newbies to experts.
Of course the newbies will gain more knowledge and experience due to the amount of new features and ideas they will meet.
For the more experienced ones , I added

  • case studies
  • specific steps that we don’t use regularly and use of steps In scenarios
  • steps you don’t understand why they created them until you find why
  • tips and tricks

General announcement:

tip 2#: there is no course in the world that will make you understand how to work with this software without serious hands on. You must practice until you get it.
It’s like having a subscription for the gym and not going. (I wonder where are the muscles they promised)

but you should know at least some :

1. database design
2. sql queries

If you dont know SQL then you can search on udemy :
i think this course can be good for you : sql-database-course-for-kettle-tutorial

if you want to be an expert database designer , i suggest you read the “bible” of datawarehouse toolkit
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling

database design and pentaho kettle

link to wikipedia : database design

find more information about pentaho kettle

here are some resources for learning more about pentaho kettle

1 pentaho kettle forum (click here)

2 books
I have them all , because learning is my hobby:
I can remember myself 10 years ago traveling to London with my girlfriend (now she’s my wife) spending six hours at the bookstore for information technologies that nowhere to be find but there, returning home with 12 kg of books on my back.
Then I evolved: my back went backward and the technology went forward.
So I bought books on Amazon. Then they invented the Kindle and now we have gone to online courses.
So udemy became my best friend.

book #1: “pentaho kettle solution” (click here)
created by: Matt Custers – who also developed kettle before it was bought by pentaho, Roland Bouman, Jos von Dongen.
From my point of view this is the “Bible” of pentaho kettle.
It deals with more complex, exotic subject so if I must say I think my course represents 80% of the material that gives you 95% of what you need day by day.
The book me go into details in subjects like:

• Input / output xml
• Unstructured data handling
• Nosql databases (mongo…)
• Web services
• Clustering (scale out on more servers for performance)
I would like to compliment Matt casters for developing one incredible piece of software that helps all of us making better solutions for managers decisions. Great job.

Book #2 : pentaho solutions (click here)
written by: Roland Bouman, Jos von Dongen
very good book also, but these more general about kettle and deals with the whole pentaho suite.

Book #3 : “pentaho 3.2 data integration” (click here)
by Maria carina Roldan it’s a beginning guide to pentaho kettle

3 Examples that comes with the installation of pentaho kettle
under folder data integration you have examples of many steps that you can run with spoon (the bat file that runs kettle) and learn.

4 blogs:

1. My blog here in the website
2. Matt casters blog
3. Roland Bouman blog

I think the combination of my course and the first book : “pentaho kettle solutions” will give you the whole package.