Thursday, December 31, 2015

year in review

2015, what a year. The year that I eventually managed to get out of my comfort zone (or one of them) and left my country, to work and live abroad, I also proposed to my wife and got engaged.

The most weird and unfortunate year in terms of job and career, that did not go very well - many funny people with funny stories during a very funny job (I am still trying to get over it and not explode with anger - but I am working on it). At least I got to do a lot of reading and self study!

Overall it was not bad. Yes, some things did not work out,  but I am feeling optimistic. It felt like a year of life changing events and now in this very last day, I'm making dreams for more exciting and bigger things to come (or hopes).  I feel that I am eventually changing as a person, embrace change a lot easier, while being happier and less stressed, than I used to be. This is a big change for a person that spent a lot of his time, constraining himself on several areas, beliefs or ideas.

Goodbye 2015, for sure you will be remembered :) . 2016 here I come!

ps
Dear Santa,  I would kindly like to ask you for a decent job next year. A job where I will manage to funnel all my passion and love.  A job where I will be able to share my knowledge and learn at the same time. I know it is not an easy gift to find but as I already elaborated I am feeling optimistic.

Paris

Καλή χρονιά σε όλους, υγεία και πολλά όνειρα.



Thursday, December 17, 2015

My evolved news crawler :) v1.8

Well I needed to kill some time during this strange intermission period - between jobs. My original 1 hour hack (less than 100 lines of code), evolved to something more flexible and useful (I hope so). Eventually my father is very happy now, instead of 1 newspaper summary he now receives 10.

He was also kind enough, to email me (from the Pacific) some early bugs like duplicate entries and formatting issues, which I tried to resolve. It is always fun to have someone use your code, isn't it? 

Of course in order to honor my Java development heritage, in this small tool I had to create my own  mini framework / crawling logic  - all java devs do it!! It's not that complex actually, and now I can easily add more crawlers for similar sites.

So currently I support the following sites (greek at the time being) but I will keep adding more :
I have also added 2 optional command line arguments.
  • flag to control the max amount of articles to be crawled and included in the final report.
  • flag to control the creation of zip files, that contain each  html report. That way I manage to reduce the size even more. So when I email them the payload is far less :).
You can find more in the official github page. By the way I try to keep my documentation up to date.

You will find all the required material in order to run or compile this small utility, plus any requirements.

I will soon add a small section, for those (if there is anyone interested) that would like to plug, extra crawling implementations for other RSS based sites.

Of course there a lot of stuff that I could do, in order  to improve the utility and most probably I will continue to add, crawlers for sites and make the design more 'modular'.

happy crawling .

Wednesday, December 09, 2015

Playing with JSoup and crawling a greek newspaper ...in order to deliver news in the middle of the ocean :)

Recently I stumbled upon several articles and examples of this handy library called JSoup, and I wanted to give it a try. It was a good opportunity to play and experiment around with CSS Selectors

My main need was a family request. My father is a captain for the trade navy. He still travels around the oceans in big tankers and cargo ships. (Like those below, actually this is one of them)



Nowadays all  of these vessels have satellite coms, but in order to open the link and transfer any data, costs a lot. Most of the crew usually, gets some kind of prepaid cards from the satellite internet provider, and they can eventually use skype or any other service, for a short time...very short. To cut a long story short, in case anyone wants to read the news in a regular site, only by opening the site, the amount of images and content would eventually cost him a lot in credits. My father wants to keep in touch with the news back in Greece, so for some time, I was manually copy pasting news  in a simple html or Word file. Then I was sending him this summary of news through email. Of course this manual thing every morning was kind of boring and error prone. I needed to create something that will do the same thing for me.

I spent less than 2 hours, last night and with some basic calls and functionality provided by Jsoup, I hacked my custom crawler (its not rocket science). Make note that is a very specific tool, it crawls a specific newspaper (tovima.gr that is) and it ouputs the summery of the article content to a plain html file. 



You can find the project and code here. Feel free to use it, if by any chance you have the same need, extend it and maybe add a similar crawler for another newspaper? There is a README section and a helper bash script.

You need Java 8 to use it and maven 3 to build it. I am using the maven-shade plugin to create a small uber-jar.