Monday 30 December 2013

org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException: A job instance already exists and is complete for parameters


Spring Batch requires unique job parameters for its execution.so you can add the current time as a job parameter

Map<String, JobParameter> confMap = new HashMap<String, JobParameter>();
confMap.put("time", new JobParameter(System.currentTimeMillis()));
JobParameters jobParameters = new JobParameters(confMap);
jobLauncher.run(springCoreJob, jobParameters);

Friday 20 December 2013

Sort mapreduce output keys in descending order

Add the following class to your current class

public static class ReverseComparator extends WritableComparator {
    
    private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
    public ReverseComparator() {
        super(Text.class);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
       return (-1)* TEXT_COMPARATOR.compare(b1, s1, l1, b2, s2, l2);
    }

    @SuppressWarnings("rawtypes")
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        if (a instanceof Text && b instanceof Text) {
                return (-1)*(((Text) a).compareTo((Text) b));
        }
        return super.compare(a, b);
    }
}
in new api(mapreduce) add the following to your configuration.
Job.setSortComparator(ReverseComparator.class);

NB. This only works if your key belongs to Text.class else modify the reverse comparator class accordingly

Set separator for mapreduce output

By default the output separator is a single space, to set the output separated by our desired character set this configuration
conf.set("mapred.textoutputformat.separator", ",");
The map reduce(ie the key and values) output will be comma separated in this case.

where conf is a org.apache.hadoop.conf.Configuration  object


Tuesday 10 December 2013

Region servers going down in cdh4 due to mapreduce job

I faced this problem because i had set the scan caching to 500 ie it passes 500 rows to your mapreduce job which is memory intensive and not recommended

data driven db input format

Include the id also.....

in case of dbinput format dont use the id in the VO.

Thursday 21 November 2013

Password of cloudera-scm user

It is present in the directory

/var/lib/cloudera-scm-server-db/data/generated_password.txt

Wednesday 20 November 2013

Enable logging feature for soap webserivce to get test case response

In web-inf folder inside serivce.xml

<jaxws:endpoint...>  
   <jaxws:features>  
      <bean class="org.apache.cxf.feature.LoggingFeature"/>
   </jaxws:features>  
</jaxws:endpoint>

Thursday 14 November 2013

Find count of each word in a file in linux command line

tr -s [:space:] \\n < your_filename | sort | uniq --count | sort -rn | head -n 50

This shows the top 50 words in your file in a sorted manner

Tuesday 5 November 2013

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/client/Scan

Setting the hadoop_classpath variable will fix this issue
 
export HADOOP_CLASSPATH=`/usr/bin/hbase classpath`

Thursday 31 October 2013

To install R studio in centos 6

Centos doesnot support R studio desktop so you would have to install r studio server instead which works in a browser


For EL6


 
RStudio Server has several dependencies on packages (including R itself) found in the Extra Packages for Enterprise Linux (EPEL) repository. If you don't already have this repository available you should add it to your system using the instructions found on the Fedora EPEL website.

$ su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm'


After enabling EPEL you should then ensure that you have installed the version of R available from EPEL. You can do this using the following command:
 
$ sudo yum install R
 
 

Download and Install

To download and install RStudio Server open a terminal window and execute the commands corresponding to the 32 or 64-bit version as appropriate.

32-bit Size: 17.5 MB MD5: 3bc83db8c23c212c391342e731f65823

$ wget http://download2.rstudio.org/rstudio-server-0.97.551-i686.rpm $ sudo yum install --nogpgcheck rstudio-server-0.97.551-i686.rpm

64-bit Size: 17.6 MB MD5: c89d5574a587472d06f72b304356a776

$ wget http://download2.rstudio.org/rstudio-server-0.97.551-x86_64.rpm $ sudo yum install --nogpgcheck rstudio-server-0.97.551-x86_64.rpm


Then in your browser go to address

http://<your_server_name>:8787

The login credentials are the current username and password of your centos account
 


Refernce : http://www.rstudio.com/ide/download/server

Wednesday 30 October 2013

HBase

list -  to list the tables

if needed add
<property>
        <name>hbase.zookeeper.property.clientPort</name>
        <value>2182</value>
</property>


in zoo.cfg change to 2182

in hbase-env.sh uncomment

#HBASE_MANAGES_ZK = true

try restarting the processes



Run from terminal to count the no of distinct rows in a hbase table

echo "scan 'table'" | hbase shell | grep "columnFamliyName:" | wc -l

Divide the count by the no of columns and you will  get the no of distinct rows

Grep and replace linux

Finding and replacing using grep in linux / centos 6+ 

1. The syntax is 
 grep -rl 'findWord' path_to_your_file_name | xargs sed -i 's/findWord/replaceWord/g'

2. Searching for all files in the current directory for the term windows and replace it with linux
grep -rl 'windows' ./ | xargs sed -i 's/windows/linux/g'
s/ - indicates the start
/g - indicates the end
./ - the current directory

3. You can use regex also - eg.
grep -rl '{:\".*\",\"hello\":\"' data.txt | xargs sed -i 's/{:\".*\",\"hello\":\"//g'

Sunday 20 October 2013

Thursday 17 October 2013

Point hbase conf folder to another folder using alternatives(a linux command) in CentOS

 alternatives]$ alternatives --display hbase-conf
hbase-conf - status is auto.
 link currently points to /etc/hbase/conf.dist
/etc/hbase/conf.dist - priority 30
Current `best' version is /etc/hbase/conf.dist.
 alternatives]$ alternatives --install /etc/hbase/conf hbase-conf /etc/hbase/conf.my_cluster/ 10
failed to create /var/lib/alternatives/hbase-conf.new: Permission denied

alternatives]$ sudo alternatives --install /etc/hbase/conf hbase-conf /etc/hbase/conf.my_cluster/ 10

 alternatives]$ sudo alternatives --display hbase-conf
hbase-conf - status is auto.
 link currently points to /etc/hbase/conf.dist
/etc/hbase/conf.dist - priority 30
/etc/hbase/conf.my_cluster/ - priority 10
Current `best' version is /etc/hbase/conf.dist.


 alternatives]$ sudo alternatives --set hbase-conf /etc/hbase/conf.my_cluster/


 alternatives]$ sudo alternatives --display hbase-conf
hbase-conf - status is manual.
 link currently points to /etc/hbase/conf.my_cluster/
/etc/hbase/conf.dist - priority 30
/etc/hbase/conf.my_cluster/ - priority 10
Current `best' version is /etc/hbase/conf.dist.

Owlim lite installation Centos

Follow the instructions on this link 

http://owlim.ontotext.com/display/OWLIMv52/OWLIM-Lite+Installation

if access denied is shown at times then change the owner of that folder to tomcat

sudo chown tomcat:tomcat your_folder

If you have more than 64000 entities to be stored in owlim then to support more than that edit the tomcat6.conf

/etc/tomcat6/tomcat6.conf
CATALINA_OPTS = "-DentityExpansionLimit=1000000(or your desired no.)"

restart tomcat and everything should be fine
 sudo service tomcat6 restart

Thursday 10 October 2013

RdfParse Exception in eclipse when using large rdf files

org.openrdf.rio.RDFParseException: The parser has encountered more than "64,000" entity expansions in this document; this is the limit imposed by the application.


This error occurs when using sparql queries because it doesnot allow you to process more than 64000 entities

Open Window -> Preferences -> Java -> Installed JRE's
edit the installed jre's
add the following to the default vm's arguments , the arguments must be separated by space.

-Xmx1024m -DentityExpansionLimit=100000


Wednesday 9 October 2013

Jena Fuseki server configurations

Download and unzip the jena fuseki server

to start the server with the dataset to be loaded in the memory

fuseki-server --update --mem /ds

to start the server with the dataset to be loaded into a user specified directory

fuseki-server --update --loc=your_path_to_directory /ds


if you don't specify --mem then by default the data set uploaded will be stored to the DB directory that is present in the unzipped folder 


You can specify a custom assembler using

fuseki-server --update /inf --desc=assembler.ttl

assembler.ttl

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix sdb: <http://jena.hpl.hp.com/2007/sdb#> .

[] rdf:type fuseki:Server ;

 fuseki:services (
 <#tdb>
 ) .

<#tdb>  rdf:type fuseki:Service ;
 fuseki:name              "tdb" ;             # http://host/inf
 fuseki:serviceQuery      "sparql" ;          # SPARQL query service
 fuseki:serviceUpdate     "update" ;
 fuseki:dataset           <#dataset2> ;       #select which set to
 .                                            #use

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

<#dataset2> rdf:type ja:RDFDataset ;
 ja:defaultGraph <#model2>;
 .        

<#model2> a ja:OntModel;
 ja:baseModel <#tdbGraph>;
 ja:ontModelSpec ja:OWL_MEM ;
 ja:content [ja:externalContent <file:////home/wn20full/wnfull.rdfs>]
 .


<#tdbGraph> rdf:type tdb:GraphTDB;
 tdb:location "DB";
 .

<#interpretationSchema> a ja:MemoryModel ;
    ja:content [
        ja:externalContent <file:////home/wn20full/wnfull.rdfs> ;

    ] .

Wednesday 2 October 2013

How to create a user in CentOS and give sudo access

To create a user in CentOS follow these steps

1. You must be logged in as root to add a new user

2. Issue the useradd command to create a locked account
useradd <username>

3. Issue the passwd command to set the password of the newly created user
passwd <username>
This will you prompt you to enter the password of the newly created user


4. To give the user sudo access you need to add  the user to wheel group 

To do this issue the command : visudo 

To be able to add a user to the wheel group you must uncomment the line ie
#%wheel   ALL=(ALL)   ALL
 to
%wheel   ALL=(ALL)   ALL


5. Adding the newly created user to wheel group
usermod -G wheel  <username>

6. Finished  - Your user now has sudo access !! enjoy !!

How to manually setup hbase for your cloudera cdh4 cluster in CentOS | RHEL | Linux

Follow these steps


1. Install hbase on all the machines

sudo yum install hbase

2.Install hbase-master and zookeper-server on your master machine

sudo yum install zookeper-server
sudo yum install hbase-master

zookeeper-server automatically installs the base zookeper package also.
For hbase to start it needs to have zookeeper 


3. Install hbase-region server and zookeeper in all your slave machines

sudo yum install zookeeper
sudo yum install hbase-regionserver

4.Modifying the HBase Configuration


To enable pseudo-distributed mode, you must first make some configuration changes. Open /etc/hbase/conf/hbase-site.xml in your editor of choice, and insert the following XML properties between the <configuration> and </configuration> tags. Be sure to replace myhost with the hostname of your HDFS NameNode (as specified by fs.default.name or fs.defaultFS in your hadoop/conf/core-site.xml file); you may also need to change the port number from the default (8020).
<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>
<property>
  <name>hbase.rootdir</name>
  <value>hdfs://myhost:8020/hbase</value>
</property>

5.Configuring for Distributed Operation


After you have decided which machines will run each process, you can edit the configuration so that the nodes may locate each other. In order to do so, you should make sure that the configuration files are synchronized across the cluster. Cloudera strongly recommends the use of a configuration management system to synchronize the configuration files, though you can use a simpler solution such as rsync to get started quickly.
The only configuration change necessary to move from pseudo-distributed operation to fully-distributed operation is the addition of the ZooKeeper Quorum address in hbase-site.xml. Insert the following XML property to configure the nodes with the address of the node where the ZooKeeper quorum peer is running:

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>mymasternode</value>
</property>

6.Creating the /hbase Directory in HDFS


Before starting the HBase Master, you need to create the /hbase directory in HDFS. The HBase master runs as hbase:hbase so it does not have the required permissions to create a top level directory.
To create the /hbase directory in HDFS:
$ sudo -u hdfs hadoop fs -mkdir /hbase
$ sudo -u hdfs hadoop fs -chown hbase /hbase

7.Starting the ZooKeeper Server

  • To start ZooKeeper after a fresh install:
$ sudo service zookeeper-server init
$ sudo service zookeeper-server start

8.Starting zookeeper

$ sudo service zookeeper start


9.Starting the HBase Master

  • On Red Hat and SLES systems (using .rpm packages) you can now start the HBase Master by using the included service script:
$ sudo service hbase-master start

To start the Region Server:
$ sudo service hbase-regionserver start

10.Accessing HBase by using the HBase Shell

After you have started HBase, you can access the database by using the HBase Shell:
$ hbase shell