Monday 30 December 2013

org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException: A job instance already exists and is complete for parameters

Spring Batch requires unique job parameters for its execution.so you can add the current time as a job parameter

Map<String, JobParameter> confMap = new HashMap<String, JobParameter>();
confMap.put("time", new JobParameter(System.currentTimeMillis()));
JobParameters jobParameters = new JobParameters(confMap);
jobLauncher.run(springCoreJob, jobParameters);

Friday 20 December 2013

Sort mapreduce output keys in descending order

Add the following class to your current class

public static class ReverseComparator extends WritableComparator {
    
    private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
    public ReverseComparator() {
        super(Text.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
       return (-1)* TEXT_COMPARATOR.compare(b1, s1, l1, b2, s2, l2);
    }

    @SuppressWarnings("rawtypes")
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        if (a instanceof Text && b instanceof Text) {
                return (-1)*(((Text) a).compareTo((Text) b));
        }
        return super.compare(a, b);
    }
}

in new api(mapreduce) add the following to your configuration.

Job.setSortComparator(ReverseComparator.class);

NB. This only works if your key belongs to Text.class else modify the reverse comparator class accordingly

Set separator for mapreduce output

By default the output separator is a single space, to set the output separated by our desired character set this configuration

conf.set("mapred.textoutputformat.separator", ",");

The map reduce(ie the key and values) output will be comma separated in this case.

where conf is a org.apache.hadoop.conf.Configuration object

Tuesday 10 December 2013

Region servers going down in cdh4 due to mapreduce job

I faced this problem because i had set the scan caching to 500 ie it passes 500 rows to your mapreduce job which is memory intensive and not recommended

data driven db input format

Include the id also.....

in case of dbinput format dont use the id in the VO.

Thursday 21 November 2013

Password of cloudera-scm user

It is present in the directory

/var/lib/cloudera-scm-server-db/data/generated_password.txt

Wednesday 20 November 2013

Enable logging feature for soap webserivce to get test case response

In web-inf folder inside serivce.xml

<jaxws:endpoint...>

<jaxws:features>

<bean class="org.apache.cxf.feature.LoggingFeature"/>

</jaxws:features>

</jaxws:endpoint>

Thursday 14 November 2013

Find count of each word in a file in linux command line

tr -s [:space:] \\n < your_filename | sort | uniq --count | sort -rn | head -n 50

This shows the top 50 words in your file in a sorted manner

Tuesday 5 November 2013

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/client/Scan

Setting the hadoop_classpath variable will fix this issue

export HADOOP_CLASSPATH=`/usr/bin/hbase classpath`

Thursday 31 October 2013

To install R studio in centos 6

Centos doesnot support R studio desktop so you would have to install r studio server instead which works in a browser

For EL6

RStudio Server has several dependencies on packages (including R itself) found in the Extra Packages for Enterprise Linux (EPEL) repository. If you don't already have this repository available you should add it to your system using the instructions found on the Fedora EPEL website.

$ su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm'

After enabling EPEL you should then ensure that you have installed the version of R available from EPEL. You can do this using the following command:

$ sudo yum install R

Download and Install

To download and install RStudio Server open a terminal window and execute the commands corresponding to the 32 or 64-bit version as appropriate.

32-bit Size: 17.5 MB MD5: 3bc83db8c23c212c391342e731f65823

$ wget http://download2.rstudio.org/rstudio-server-0.97.551-i686.rpm
$ sudo yum install --nogpgcheck rstudio-server-0.97.551-i686.rpm

64-bit Size: 17.6 MB MD5: c89d5574a587472d06f72b304356a776

$ wget http://download2.rstudio.org/rstudio-server-0.97.551-x86_64.rpm
$ sudo yum install --nogpgcheck rstudio-server-0.97.551-x86_64.rpm

Then in your browser go to address

http://<your_server_name>:8787

The login credentials are the current username and password of your centos account

Refernce : http://www.rstudio.com/ide/download/server

Wednesday 30 October 2013

HBase

list - to list the tables

if needed add
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2182</value>
</property>

in zoo.cfg change to 2182

in hbase-env.sh uncomment

#HBASE_MANAGES_ZK = true

try restarting the processes

Run from terminal to count the no of distinct rows in a hbase table

echo "scan 'table'" | hbase shell | grep "columnFamliyName:" | wc -l

Divide the count by the no of columns and you will get the no of distinct rows

Grep and replace linux

Finding and replacing using grep in linux / centos 6+

1. The syntax is

grep -rl 'findWord' path_to_your_file_name | xargs sed -i 's/findWord/replaceWord/g'

2. Searching for all files in the current directory for the term windows and replace it with linux

grep -rl 'windows' ./ | xargs sed -i 's/windows/linux/g'

s/ - indicates the start

/g - indicates the end

./ - the current directory

3. You can use regex also - eg.

grep -rl '{:\".*\",\"hello\":\"' data.txt | xargs sed -i 's/{:\".*\",\"hello\":\"//g'

Sunday 20 October 2013

Include jquery in phantomjs

Use page.injectJs('jquery-1.6.1.min.js'); it will work fine.
here is a beautiful link for using jquery in phantomjs http://snippets.aktagon.com/snippets/534-how-to-scrape-web-pages-with-phantomjs-and-jquery

Thursday 17 October 2013

Point hbase conf folder to another folder using alternatives(a linux command) in CentOS

alternatives]$ alternatives --display hbase-conf
hbase-conf - status is auto.
link currently points to /etc/hbase/conf.dist
/etc/hbase/conf.dist - priority 30
Current `best' version is /etc/hbase/conf.dist.
alternatives]$ alternatives --install /etc/hbase/conf hbase-conf /etc/hbase/conf.my_cluster/ 10
failed to create /var/lib/alternatives/hbase-conf.new: Permission denied

alternatives]$ sudo alternatives --install /etc/hbase/conf hbase-conf /etc/hbase/conf.my_cluster/ 10

alternatives]$ sudo alternatives --display hbase-conf
hbase-conf - status is auto.
link currently points to /etc/hbase/conf.dist
/etc/hbase/conf.dist - priority 30
/etc/hbase/conf.my_cluster/ - priority 10
Current `best' version is /etc/hbase/conf.dist.

alternatives]$ sudo alternatives --set hbase-conf /etc/hbase/conf.my_cluster/

alternatives]$ sudo alternatives --display hbase-conf
hbase-conf - status is manual.
link currently points to /etc/hbase/conf.my_cluster/
/etc/hbase/conf.dist - priority 30
/etc/hbase/conf.my_cluster/ - priority 10
Current `best' version is /etc/hbase/conf.dist.

Owlim lite installation Centos

Follow the instructions on this link

http://owlim.ontotext.com/display/OWLIMv52/OWLIM-Lite+Installation

if access denied is shown at times then change the owner of that folder to tomcat

sudo chown tomcat:tomcat your_folder

If you have more than 64000 entities to be stored in owlim then to support more than that edit the tomcat6.conf

/etc/tomcat6/tomcat6.conf
CATALINA_OPTS = "-DentityExpansionLimit=1000000(or your desired no.)"

restart tomcat and everything should be fine
sudo service tomcat6 restart

Thursday 10 October 2013

RdfParse Exception in eclipse when using large rdf files

org.openrdf.rio.RDFParseException: The parser has encountered more than "64,000" entity expansions in this document; this is the limit imposed by the application.

This error occurs when using sparql queries because it doesnot allow you to process more than 64000 entities

Open Window -> Preferences -> Java -> Installed JRE's
edit the installed jre's
add the following to the default vm's arguments , the arguments must be separated by space.

-Xmx1024m -DentityExpansionLimit=100000

Wednesday 9 October 2013

Jena Fuseki server configurations

Download and unzip the jena fuseki server

to start the server with the dataset to be loaded in the memory

fuseki-server --update --mem /ds

to start the server with the dataset to be loaded into a user specified directory

fuseki-server --update --loc=your_path_to_directory /ds

if you don't specify --mem then by default the data set uploaded will be stored to the DB directory that is present in the unzipped folder

You can specify a custom assembler using

fuseki-server --update /inf --desc=assembler.ttl

assembler.ttl

@prefix : <#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix sdb: <http://jena.hpl.hp.com/2007/sdb#> .

[] rdf:type fuseki:Server ;

fuseki:services (
<#tdb>
) .

<#tdb> rdf:type fuseki:Service ;
fuseki:name "tdb" ; # http://host/inf
fuseki:serviceQuery "sparql" ; # SPARQL query service
fuseki:serviceUpdate "update" ;
fuseki:dataset <#dataset2> ; #select which set to
. #use

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset .
tdb:GraphTDB rdfs:subClassOf ja:Model .

<#dataset2> rdf:type ja:RDFDataset ;
ja:defaultGraph <#model2>;
.

<#model2> a ja:OntModel;
ja:baseModel <#tdbGraph>;
ja:ontModelSpec ja:OWL_MEM ;
ja:content [ja:externalContent <file:////home/wn20full/wnfull.rdfs>]
.

<#tdbGraph> rdf:type tdb:GraphTDB;
tdb:location "DB";
.

<#interpretationSchema> a ja:MemoryModel ;
ja:content [
ja:externalContent <file:////home/wn20full/wnfull.rdfs> ;

] .

Sunday 6 October 2013

How to synchronize time with ntp server in windows 7

In cmd type:

net set \\your_ntp_server_ip /set /yes

Wednesday 2 October 2013

How to create a user in CentOS and give sudo access

To create a user in CentOS follow these steps

1. You must be logged in as root to add a new user

2. Issue the useradd command to create a locked account

useradd <username>

3. Issue the passwd command to set the password of the newly created user

passwd <username>

This will you prompt you to enter the password of the newly created user

4. To give the user sudo access you need to add the user to wheel group

To do this issue the command : visudo

To be able to add a user to the wheel group you must uncomment the line ie

#%wheel ALL=(ALL) ALL

%wheel ALL=(ALL) ALL

5. Adding the newly created user to wheel group

usermod -G wheel <username>

6. Finished - Your user now has sudo access !! enjoy !!

How to manually setup hbase for your cloudera cdh4 cluster in CentOS | RHEL | Linux

Follow these steps

1. Install hbase on all the machines

sudo yum install hbase

2.Install hbase-master and zookeper-server on your master machine

sudo yum install zookeper-server

sudo yum install hbase-master

zookeeper-server automatically installs the base zookeper package also.

For hbase to start it needs to have zookeeper

3. Install hbase-region server and zookeeper in all your slave machines

sudo yum install zookeeper

sudo yum install hbase-regionserver

4.Modifying the HBase Configuration

To enable pseudo-distributed mode, you must first make some configuration changes. Open /etc/hbase/conf/hbase-site.xml in your editor of choice, and insert the following XML properties between the <configuration> and </configuration> tags. Be sure to replace myhost with the hostname of your HDFS NameNode (as specified by fs.default.name or fs.defaultFS in your hadoop/conf/core-site.xml file); you may also need to change the port number from the default (8020).

<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>
<property>
  <name>hbase.rootdir</name>
  <value>hdfs://myhost:8020/hbase</value>
</property>

5.Configuring for Distributed Operation

After you have decided which machines will run each process, you can edit the configuration so that the nodes may locate each other. In order to do so, you should make sure that the configuration files are synchronized across the cluster. Cloudera strongly recommends the use of a configuration management system to synchronize the configuration files, though you can use a simpler solution such as rsync to get started quickly.

The only configuration change necessary to move from pseudo-distributed operation to fully-distributed operation is the addition of the ZooKeeper Quorum address in hbase-site.xml. Insert the following XML property to configure the nodes with the address of the node where the ZooKeeper quorum peer is running:

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>mymasternode</value>
</property>

6.Creating the /hbase Directory in HDFS

Before starting the HBase Master, you need to create the /hbase directory in HDFS. The HBase master runs as hbase:hbase so it does not have the required permissions to create a top level directory.

To create the /hbase directory in HDFS:

$ sudo -u hdfs hadoop fs -mkdir /hbase
$ sudo -u hdfs hadoop fs -chown hbase /hbase

7.Starting the ZooKeeper Server

To start ZooKeeper after a fresh install:

$ sudo service zookeeper-server init
$ sudo service zookeeper-server start

8.Starting zookeeper

$ sudo service zookeeper start

9.Starting the HBase Master

On Red Hat and SLES systems (using .rpm packages) you can now start the HBase Master by using the included service script:

$ sudo service hbase-master start

To start the Region Server:

$ sudo service hbase-regionserver start

10.Accessing HBase by using the HBase Shell

After you have started HBase, you can access the database by using the HBase Shell:

$ hbase shell

For further reference :
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_20_2.html