jeudi 15 mai 2014

Regex - Convert fonctions de ruche à java - traduire et regexp_replace - Stack Overflow


1) How do I convert the hive part below to java map reduce?


 translate(regexp_replace(colA,"(\\\\=)","\\\\equalto"),"\[\]\(\)\{\}\^\?\+\*\$","____________") 

In the regexp_replace, I'm replacing all =, in the outer translate, I'm replacing all characters that affect future regexp_replace parses.(If I don't replace these characters, they raise an exception later).


2) Do I have to use replaceChars(), if yes, then how?


Sample string format is:


tag1=573 tag2=ABC 0nuif6d Saturn 0i899 AA 0 (WORD) LOWER 0 (WORD2) HH 0 BB 0 CC 1 LL 0 D 0 FF 0 AB 0 UPPER 0 (ONCOLD) UPPER 1 PART: Sold \= 88vb JJ number\= 0 String "String_here" ANDND JUJFNG fill EXTRA SUNSET: empty tag3=/Informational tag4=/Value tag5=Value1/Value2 tag6=/AB/Acs Sy/Api Afg Hold Cones/HHH+11: 4.3.2-4.3.4 tag6=11123 tag7=Hello World tag8=a-dfdAds\=\= tag9=Value3 tag.9=Space separated words \= 88 , cold 87 Goal Run\=2, LOT OF SPACE SEPARATED GARBAGE WORDS tag.a=0( tag.b=02


Note: Tags are not hardcoded as tags. They can be any english words like serial_number or website.address like serial_no=hello world website.address=\SO.com=/question where serial_no and website.address are tags.




Description


This expression will:



  • assume tag names contain no spaces

  • assume tag names are separated from their respective values by a = which has no white space on either side of the = symbol and is not proceeded by a \

  • assume tag names are seperated from the preceeding string with white space, likewise values will be seperated from the next tag name by whitespace

  • capture the tag name and value

  • will avoid breaking the string up on equal signs which are embedded in the value side of the string


(\S*?)(?<!\\)=(\S*.*?)(?=\S*(?<!\\)=|\Z)


enter image description here


You can then reassemble or further process the individual components of the string as you see fit.


Example


Live Demo


Sample Text


From the sample text you included in the comments. It's not really clear what defines a tag or the equal sign for separating the name form the value:


serial_no=hello world website.address=\SO.com=/question tag1=573 tag2=ABC 0nuif6d Saturn 0i899 AA 0 (WORD) LOWER 0 (WORD2) HH 0 BB 0 CC 1 LL 0 D 0 FF 0 AB 0 UPPER 0 (ONCOLD) UPPER 1 PART: Sold \= 88vb JJ number\= 0 String "String_here" ANDND JUJFNG fill EXTRA SUNSET: empty tag3=/Informational tag4=/Value tag5=Value1/Value2 tag6=/AB/Acs Sy/Api Afg Hold Cones/HHH+11: 4.3.2-4.3.4 tag6=11123 tag7=Hello World tag8=a-dfdAds\=\= tag9=Value3 tag.9=Space separated words \= 88 , cold 87 Goal Run\=2, LOT OF SPACE SEPARATED GARBAGE WORDS tag.a=0( tag.b=02


Sample Code


import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
public static void main(String[] asd){
String sourcestring = "source string to match with pattern";
Pattern re = Pattern.compile("(\\S*?)(?<!\\\\)=(\\S*.*?)(?=\\S*(?<!\\\\)=|\\Z)",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
}
}

Matches


Group 0 will have the entire substring Group 1 will have the name field Group 2 will have the value field


[0][0] = serial_no=hello world 
[0][1] = serial_no
[0][2] = hello world

[1][0] = website.address=\SO.com=/question
[1][1] = website.address
[1][2] = \SO.com=/question

[2][0] = tag1=573
[2][1] = tag1
[2][2] = 573

[3][0] = tag2=ABC 0nuif6d Saturn 0i899 AA 0 (WORD) LOWER 0 (WORD2) HH 0 BB 0 CC 1 LL 0 D 0 FF 0 AB 0 UPPER 0 (ONCOLD) UPPER 1 PART: Sold \= 88vb JJ number\= 0 String "String_here" ANDND JUJFNG fill EXTRA SUNSET: empty
[3][1] = tag2
[3][2] = ABC 0nuif6d Saturn 0i899 AA 0 (WORD) LOWER 0 (WORD2) HH 0 BB 0 CC 1 LL 0 D 0 FF 0 AB 0 UPPER 0 (ONCOLD) UPPER 1 PART: Sold \= 88vb JJ number\= 0 String "String_here" ANDND JUJFNG fill EXTRA SUNSET: empty

[4][0] = tag3=/Informational
[4][1] = tag3
[4][2] = /Informational

[5][0] = tag4=/Value
[5][1] = tag4
[5][2] = /Value

[6][0] = tag5=Value1/Value2
[6][1] = tag5
[6][2] = Value1/Value2

[7][0] = tag6=/AB/Acs Sy/Api Afg Hold Cones/HHH+11: 4.3.2-4.3.4
[7][1] = tag6
[7][2] = /AB/Acs Sy/Api Afg Hold Cones/HHH+11: 4.3.2-4.3.4

[8][0] = tag6=11123
[8][1] = tag6
[8][2] = 11123

[9][0] = tag7=Hello World
[9][1] = tag7
[9][2] = Hello World

[10][0] = tag8=a-dfdAds\=\=
[10][1] = tag8
[10][2] = a-dfdAds\=\=

[11][0] = tag9=Value3
[11][1] = tag9
[11][2] = Value3

[12][0] = tag.9=Space separated words \= 88 , cold 87 Goal Run\=2, LOT OF SPACE SEPARATED GARBAGE WORDS
[12][1] = tag.9
[12][2] = Space separated words \= 88 , cold 87 Goal Run\=2, LOT OF SPACE SEPARATED GARBAGE WORDS

[13][0] = tag.a=0(
[13][1] = tag.a
[13][2] = 0(

[14][0] = tag.b=02
[14][1] = tag.b
[14][2] = 02


1) How do I convert the hive part below to java map reduce?


 translate(regexp_replace(colA,"(\\\\=)","\\\\equalto"),"\[\]\(\)\{\}\^\?\+\*\$","____________") 

In the regexp_replace, I'm replacing all =, in the outer translate, I'm replacing all characters that affect future regexp_replace parses.(If I don't replace these characters, they raise an exception later).


2) Do I have to use replaceChars(), if yes, then how?


Sample string format is:


tag1=573 tag2=ABC 0nuif6d Saturn 0i899 AA 0 (WORD) LOWER 0 (WORD2) HH 0 BB 0 CC 1 LL 0 D 0 FF 0 AB 0 UPPER 0 (ONCOLD) UPPER 1 PART: Sold \= 88vb JJ number\= 0 String "String_here" ANDND JUJFNG fill EXTRA SUNSET: empty tag3=/Informational tag4=/Value tag5=Value1/Value2 tag6=/AB/Acs Sy/Api Afg Hold Cones/HHH+11: 4.3.2-4.3.4 tag6=11123 tag7=Hello World tag8=a-dfdAds\=\= tag9=Value3 tag.9=Space separated words \= 88 , cold 87 Goal Run\=2, LOT OF SPACE SEPARATED GARBAGE WORDS tag.a=0( tag.b=02


Note: Tags are not hardcoded as tags. They can be any english words like serial_number or website.address like serial_no=hello world website.address=\SO.com=/question where serial_no and website.address are tags.



Description


This expression will:



  • assume tag names contain no spaces

  • assume tag names are separated from their respective values by a = which has no white space on either side of the = symbol and is not proceeded by a \

  • assume tag names are seperated from the preceeding string with white space, likewise values will be seperated from the next tag name by whitespace

  • capture the tag name and value

  • will avoid breaking the string up on equal signs which are embedded in the value side of the string


(\S*?)(?<!\\)=(\S*.*?)(?=\S*(?<!\\)=|\Z)


enter image description here


You can then reassemble or further process the individual components of the string as you see fit.


Example


Live Demo


Sample Text


From the sample text you included in the comments. It's not really clear what defines a tag or the equal sign for separating the name form the value:


serial_no=hello world website.address=\SO.com=/question tag1=573 tag2=ABC 0nuif6d Saturn 0i899 AA 0 (WORD) LOWER 0 (WORD2) HH 0 BB 0 CC 1 LL 0 D 0 FF 0 AB 0 UPPER 0 (ONCOLD) UPPER 1 PART: Sold \= 88vb JJ number\= 0 String "String_here" ANDND JUJFNG fill EXTRA SUNSET: empty tag3=/Informational tag4=/Value tag5=Value1/Value2 tag6=/AB/Acs Sy/Api Afg Hold Cones/HHH+11: 4.3.2-4.3.4 tag6=11123 tag7=Hello World tag8=a-dfdAds\=\= tag9=Value3 tag.9=Space separated words \= 88 , cold 87 Goal Run\=2, LOT OF SPACE SEPARATED GARBAGE WORDS tag.a=0( tag.b=02


Sample Code


import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
public static void main(String[] asd){
String sourcestring = "source string to match with pattern";
Pattern re = Pattern.compile("(\\S*?)(?<!\\\\)=(\\S*.*?)(?=\\S*(?<!\\\\)=|\\Z)",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
}
}

Matches


Group 0 will have the entire substring Group 1 will have the name field Group 2 will have the value field


[0][0] = serial_no=hello world 
[0][1] = serial_no
[0][2] = hello world

[1][0] = website.address=\SO.com=/question
[1][1] = website.address
[1][2] = \SO.com=/question

[2][0] = tag1=573
[2][1] = tag1
[2][2] = 573

[3][0] = tag2=ABC 0nuif6d Saturn 0i899 AA 0 (WORD) LOWER 0 (WORD2) HH 0 BB 0 CC 1 LL 0 D 0 FF 0 AB 0 UPPER 0 (ONCOLD) UPPER 1 PART: Sold \= 88vb JJ number\= 0 String "String_here" ANDND JUJFNG fill EXTRA SUNSET: empty
[3][1] = tag2
[3][2] = ABC 0nuif6d Saturn 0i899 AA 0 (WORD) LOWER 0 (WORD2) HH 0 BB 0 CC 1 LL 0 D 0 FF 0 AB 0 UPPER 0 (ONCOLD) UPPER 1 PART: Sold \= 88vb JJ number\= 0 String "String_here" ANDND JUJFNG fill EXTRA SUNSET: empty

[4][0] = tag3=/Informational
[4][1] = tag3
[4][2] = /Informational

[5][0] = tag4=/Value
[5][1] = tag4
[5][2] = /Value

[6][0] = tag5=Value1/Value2
[6][1] = tag5
[6][2] = Value1/Value2

[7][0] = tag6=/AB/Acs Sy/Api Afg Hold Cones/HHH+11: 4.3.2-4.3.4
[7][1] = tag6
[7][2] = /AB/Acs Sy/Api Afg Hold Cones/HHH+11: 4.3.2-4.3.4

[8][0] = tag6=11123
[8][1] = tag6
[8][2] = 11123

[9][0] = tag7=Hello World
[9][1] = tag7
[9][2] = Hello World

[10][0] = tag8=a-dfdAds\=\=
[10][1] = tag8
[10][2] = a-dfdAds\=\=

[11][0] = tag9=Value3
[11][1] = tag9
[11][2] = Value3

[12][0] = tag.9=Space separated words \= 88 , cold 87 Goal Run\=2, LOT OF SPACE SEPARATED GARBAGE WORDS
[12][1] = tag.9
[12][2] = Space separated words \= 88 , cold 87 Goal Run\=2, LOT OF SPACE SEPARATED GARBAGE WORDS

[13][0] = tag.a=0(
[13][1] = tag.a
[13][2] = 0(

[14][0] = tag.b=02
[14][1] = tag.b
[14][2] = 02

0 commentaires:

Enregistrer un commentaire