We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy.
Unfortunately, activation email could not send to your email. Please try again.

Read File into HDFS

Thread ID:

Created:

Updated:

Platform:

Replies:

132238 Aug 22,2017 07:06 AM Aug 28,2017 05:55 AM Data Integration Platform 9
loading
Tags: Data Integration
Tim Maher
Asked On August 22, 2017 07:06 AM

Hi, I am running both Big Data Studio and Data Integration Platform on the same virtual machine.
I would like to create a DataFlow task that reads in an XLS file from C:\TEMP_FTP and insterts this file into HDFS on my localhost? Is there any documentation to achieve this?
Many thanks!

Karthikeyan SankaraVadivel [Syncfusion]
Replied On August 22, 2017 11:52 PM

Hi Tim, 

Thank you for contacting Syncfusion Support.  

In Data integration platform, we can able to read the files using the GetFile processor and insert those files in to HDFS using PutHDFS processor.  

We have prepared the workflow that met your requirement, please find the below link that contains the dataflow template and documentation for the dataflow.  


Please let us know if you need further assistance on this.  

Regards, 
Karthikeyan S  


Tim Maher
Replied On August 23, 2017 05:31 AM

Hello Karthikeyan,

This worked perfectly - many thanks!

If possible, I would like to add two more steps onto this.

Firstly, in the table I have imported into HDFS which is called Shifts , I have two columns called Shift Type and Total Cost.  Could I create a second table called 'Measures' by taking these to columns and creating  a second table with a corresponding column called Shift Type and three columns which take the Average, Median, Minimum and Maximum of Total Cost?

When this has been done I would then like to move the copy the Shifts and Measures tables into Hive on the same server which I can then read into my Analytics Platform - Qliksense.


I hope this is possible,

Best Wishes,

Tim


Karthikeyan SankaraVadivel [Syncfusion]
Replied On August 24, 2017 07:46 AM

Hi Tim, 

Yes this can be possible to create a hive table ‘Measures’ form another table ‘Shift’, using the Hive built-in functions we can calculate Mean, Median, Minimum, Maximum of the TotalCost columns.  

In Data Integration Platform, we have SelectHiveQL processor to execute the hive(HQL) SELECT query against a Hive database. Query result will be converted to Avro or CSV format. Then the result will be moved to the “Measures” table using PutHiveQL processor.  

We have prepared the sample workflow, please find the below link that contains the dataflow template and documentation for the dataflow.   


Please let me know if you need further assistance on this.  

Regards, 
Karthikeyan S  


Tim Maher
Replied On August 24, 2017 08:25 AM

Hi Karthikeyan,

Thanks for this solution.  Unfortunately, it is failing at the first step with an error :

.....Cannot create PoolableConnectionFactory (could not establish connection to jdbc:hive2://localhost:10000/default;auth=noSasl;:null). No Flowfileto route to failure:....

Can you assist please?

Many many thanks,


Tim


Karthikeyan SankaraVadivel [Syncfusion]
Replied On August 24, 2017 08:53 AM

Hi Tim, 

We suspect that the Hive Server is stopped in your machine. Please make sure that your Hive server is in running state. You can see the state of Hadoop ecosystem services in the Bigdata Studio Local Service manager.  

 

if not, Please Restart your services from the Service Manager.  
 

Once you have restarted the Hadoop ecosystem services, Please reinitiate your Data integration workflow.   

Please let me know if you have further assistance.  

Regards, 
Karthikeyan S 


Tim Maher
Replied On August 24, 2017 09:42 AM

Hi,

Thanks for your quick reply.  The services were running correctly. 

screenshot



 I have just thought that as part of another support thread I was advised to update the authentication parameters within the hive-site.xml file as I had to connect to the hive data from a separate analytics platform.  Would this have an effect ?  The parameters are now :


    hive.server2.authentication.syncfusion.username
    tim
 

 
    hive.server2.authentication.syncfusion.password
    password
 

Regards,

Tim

Nandhini K [Syncfusion]
Replied On August 25, 2017 06:07 AM

Hi Tim, 

Yes. You then need to provide credentials in the database connection URL. 
In the provided document you may find a section to set “Database Connection URL” for Hive like below, 
   
 

Where you can set authentication parameters as follows, 
Property 
Value 
Database Connection URL 
jdbc:hive2://localhost:10000/default; 
Database User 
tim 
Password 
password 

If the above configuration doesn’t work, please set authentication parameters in connection URL itself as below, 
Property 
Value 
Database Connection URL 
jdbc:hive2://localhost:10000/default; -n tim -p password 

Please let us know if you have any queries. 

Regards, 
Nandhini K 


Tim Maher
Replied On August 25, 2017 12:32 PM

Hi,


Thanks for your reply.  This is now working although it seems to fail on the last step :

HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:62 cannot recognize input near ','',' '"Total_Cost"' in value row constructor; it is possible that retrying the operation will succeed, so routing to retry: org.apache....................

I also have another question.  In the 'Insert in to Hive Table' step,  does the table that it is inserting into have to already exist ? or will it append to an existing table, overwrite an existing table or create a table is it doesn't exist?


Thanks you for your help


Karthikeyan SankaraVadivel [Syncfusion]
Replied On August 28, 2017 05:55 AM

Hi Tim, 

HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:62 cannot recognize input near ','',' '"Total_Cost"' in value row constructor; it is possible that retrying the operation will succeed, so routing to retry: org.apache.................... 
Make sure that your Insert query is properly constructed according to your Measures table. Use ‘ ‘ (single quotations)  for String column types in hive.  

For eg.   
Replacement Value in the Constructing queries processor  - insert into Measures values('${csv.2:getDelimitedField(1)}',${csv.2:getDelimitedField(2)},${csv.2:getDelimitedField(3)},${csv.2:getDelimitedField(4)},${csv.2:getDelimitedField(5)}) 

Query Generated - insert into Measures values('"Type4"',600.0,600.0,600.0,600.0)  

You can view the Query Generated from the “Constructing queries” processor using Data Provenance
I also have another question.  In the 'Insert in to Hive Table' step,  does the table that it is inserting into have to already exist ? or will it append to an existing table, overwrite an existing table or create a table is it doesn't exist? 
The “Insert in to Hive Table” processor just insert the records in to the Hive table. If the table Measures is not present in the Hive server, we need to create the table manually.  

If the table exists, It will not overwrite or create the table, It will just append the records in the existing table.  

Please let me know if you need further assistance.  

Regards, 
Karthikeyan S  


CONFIRMATION

This post will be permanently deleted. Are you sure you want to continue?

Sorry, An error occured while processing your request. Please try again later.

You are using an outdated version of Internet Explorer that may not display all features of this and other websites. Upgrade to Internet Explorer 8 or newer for a better experience.

;