Thursday, 22 February 2018

Using pytest to check ssh to multiple servers

Use case :  check ssh connectivity to multiple clusters in one test


#!/usr/bin/python import paramiko import pytest
@pytest.fixture(params=["ip1.ip1.ip1.ip1", "ip2.ip2.ip2.ip2"]) def ssh(request): ssh = paramiko.SSHClient() ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) ssh.connect(request.param,username="username",key_filename="connect.pem") yield ssh ssh.close() def test_hello(ssh): stdin, stdout, stderr = ssh.exec_command("echo hello") stdin.close() stderr.read() == b"" assert stdout.read() == b"hello\n"
run pytest in the terminal to test this.

to check plan of pytest run :
pytest --collect-only
Fixture does the setup of environment before we execute our test cases , in this case since there are multiple parameters , you can find more than one call for that method in the test plan.
To use a fixture in our method, just pass the method name in the function arguments

> pytest --collect-only
collected 2 items 
<Module Utility_test.py'>
  <Function 'test_hello[10.30.107.85]'>

  <Function 'test_hello[10.20.91.148]'>

Unable to install R packages rgl & qpcR in sparkR

If you encounter such an error while installing the package rgl in sparkR

configure: using libpng dynamic linkage checking for X... no 
configure: error: X11 not found but required, 
configure aborted.

its because X11 is a windows library and to resolve use:
Ubuntu : 

sudo apt-get install libglu1-mesa-dev


Redhat:

sudo yum install mesa-libGL-devel mesa-libGLU-devel libpng-devel

Tuesday, 18 July 2017

Sqoop views in netezza to hdfs

I pondered upon a use case to transfer netezza tables/views to hadoop system. The current flow that we are using are :
1. Netezza -> SAN
2. SAN -> S3
3. S3 -> hdfs

If there is no primary key for the table in netezza you will be forced to use -split-by option or -m option. Only use verbose if needed.

And the reverse to transfer to netezza. After analyzing the use case the best option i found was to use sqoop. We are using yarn queues hence the queue option you can ignore this option if none is setup.

1. Transfer view
#Sqoop doesnot allow you to write into existing directory so removing the directory before transferring
hdfs dfs -rm -R /apps/hive/warehouse/<hivedbname>.db/<hivetablename>

sqoop import -Dmapreduce.job.queuename=q1 --hive-import --hive-database <hivedbname> --hive-table <hivetablename> --driver org.netezza.Driver --direct --connect jdbc:netezza://<host>:<port>/<netezzadbname> --username <netezzauser> --password <netezzapwd> --table <netezza tablename> --target-dir hdfs:///apps/hive/warehouse/<hivedbname>.db/<hivetablename> -split-by <anycolumn>

If we dont use --driver org.netezza.Driver parameter the following error is encountered.

2017-07-18 09:34:53,079 ERROR [Thread-16] org.apache.sqoop.mapreduce.db.netezza.NetezzaJDBCStatementRunner: Unable to execute external table export
org.netezza.error.NzSQLException: ERROR:  Column reference "DATASLICEID" not supported for views

at org.netezza.internal.QueryExecutor.getNextResult(QueryExecutor.java:276)
at org.netezza.internal.QueryExecutor.execute(QueryExecutor.java:73)
at org.netezza.sql.NzConnection.execute(NzConnection.java:2673)
at org.netezza.sql.NzStatement._execute(NzStatement.java:849)
at org.netezza.sql.NzPreparedStatament.execute(NzPreparedStatament.java:152)
at org.apache.sqoop.mapreduce.db.netezza.NetezzaJDBCStatementRunner.run(NetezzaJDBCStatementRunner.java:75)

End of LogType:syslog


Instead of split by option we can also use -m 1 , which transfers the data in one mapper & can be a bit slow.

2. Transfer a table
#Sqoop doesnot allow you to write into existing directory so removing the directory before transferring
hdfs dfs -rm -R /apps/hive/warehouse/<hivedbname>.db/<hivetablename>

sqoop import -Dmapreduce.job.queuename=q1 --verbose --hive-import --hive-database jijo --direct --connect jdbc:netezza://<host>:<port>/<netezzadbname> --username <netezzauser> --password <netezzapwd> --table <netezza tablename> --target-dir hdfs:///apps/hive/warehouse/<hivedbname>.db/<hivetablename> -m 1


Running
analyze table <hivedbname>.<hivetablename> compute statistics
would be ideal for hive running on tez execution engine.

Wednesday, 28 June 2017

SBT stuck at getting sbt

Run
sbt -Dsbt.repository.secure=false
so that it fetches from other mirrors(uses https) if you are behind a firewall

Tuesday, 14 March 2017

Wednesday, 10 June 2015

Allow Rserve to be accessed from a remote machine

By default Rserve allows you to access the Rserve instance using localhost and port as 6311

This can be changed by creating a configuration file and specifying it as arguments to Rserve.
The configuration file is not present by default we will have to create one.
Create a file Rserv.cfg ( you can use any name doesnot matter)

Inside the Rserv.cfg
remote enable
port 6566
plaintext disable
R command to start Rserve
Rserve(args="--RS-conf D:\fakepath\Rserv.cfg")
replace absolute path in windows to the corresponding Linux equivalent. Have heard there is a config file in /etc/Rserve.conf( not sure though) if not create a new one.

This will take in the arguments specified in the cfg file
For more parameters and command line arguments for Rserve visit Rserve Documentation

Monday, 25 May 2015

Create a gmail river using imapriver in elasticsearch

curl -XPUT localhost:9200/_river/gmailriver/_meta -d 
"{"type":"imap",
   "mail.store.protocol":"imap",
   "mail.imap.host":"imap.googlemail.com",
   "mail.imap.port":993,
   "mail.imap.ssl.enable":true,
   "mail.imap.connectionpoolsize":"3",
   "mail.debug":"true",
   "mail.imap.timeout":10000,
   "user":"xxxx@gmail.com",
   "password":"xxxx$",
   "schedule":null,
   "interval":"60s",
   "threads":5,
   "folderpattern":"^INBOX$",
   "bulk_size":100,
   "max_bulk_requests":"2",
   "bulk_flush_interval":"5s",
   "mail_index_name":"gmailriveridx",
   "mail_type_name":"mail",
   "with_striptags_from_textcontent":true,
   "with_attachments":false,
   "with_text_content":true,
   "with_flag_sync":true,
   "index_settings" : null,
   "type_mapping" : null
}"