자바 유니 코드 문자열 길이

IT Share you

자바 유니 코드 문자열 길이

shareyou 2021. 1. 8. 21:46

자바 유니 코드 문자열 길이

나는 유니 코드 문자열의 수를 얻기 위해 열심히 노력하고 있으며 다양한 옵션을 시도했습니다. 작은 문제처럼 보이지만 큰 충격을 받았습니다.

여기서 문자열 str1의 길이를 얻으려고합니다. 나는 그것을 6으로 얻고있다. 그러나 실제로는 3이다. "குமார்"문자열 위로 커서를 움직이면 또한 3 개의 문자로 나타난다.

기본적으로 길이를 측정하고 각 문자를 인쇄하고 싶습니다. "கு", "மா", "ர்"등.

 public class one {
    public static void main(String[] args) {
            String str1 = new String("குமார்");
            System.out.print(str1.length());
    }
}

추신 : 타밀어입니다.

문제에 대한 해결책을 찾았습니다.

이 SO 답변을 기반으로 정규식 문자 클래스를 사용하여 선택적 수정자가있을 수있는 문자를 검색하는 프로그램을 만들었습니다. 문자열을 단일 (필요한 경우 결합) 문자로 분할하고 목록에 넣습니다.

import java.util.*;
import java.lang.*;
import java.util.regex.*;

class Main
{
    public static void main (String[] args)
    {
        String s="குமார்";
        List<String> characters=new ArrayList<String>();
        Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
        Matcher matcher = pat.matcher(s);
        while (matcher.find()) {
            characters.add(matcher.group());            
        }

        // Test if we have the right characters and length
        System.out.println(characters);
        System.out.println("String length: " + characters.size());

    }
}

여기서는 \\p{L}유니 코드 문자를 \\p{M}의미하고 유니 코드 마크를 의미합니다.

스 니펫의 출력은 다음과 같습니다.

கு
மா
ர்
String length: 3

작동하는 데모는 https://ideone.com/Apkapn 을 참조 하십시오.

편집하다

이제 http://en.wikipedia.org/wiki/Tamil_script 의 표에서 가져온 모든 유효한 타밀 문자로 정규식을 확인했습니다 . 현재 정규식으로 모든 문자를 올바르게 캡처하지 않는다는 것을 알았으므로 (Grantha 복합 테이블의 마지막 행에있는 모든 문자가 두 글자로 분할 됨) 정규식을 다음 솔루션으로 다듬 었습니다.

Pattern pat = Pattern.compile("\u0B95\u0BCD\u0BB7\\p{M}?|\\p{L}\\p{M}?");

위의 패턴 대신이 패턴을 사용하면 문장을 모든 유효한 타밀 문자로 분할 할 수 있습니다 (위키 백과의 표가 완성 된 한).

확인에 사용한 코드는 다음과 같습니다.

String s = "ஃஅஆஇஈஉஊஎஏஐஒஓஔக்ககாகிகீகுகூகெகேகைகொகோகௌங்ஙஙாஙிஙீஙுஙூஙெஙேஙைஙொஙோஙௌச்சசாசிசீசுசூசெசேசைசொசோசௌஞ்ஞஞாஞிஞீஞுஞூஞெஞேஞைஞொஞோஞௌட்டடாடிடீடுடூடெடேடைடொடோடௌண்ணணாணிணீணுணூணெணேணைணொணோணௌத்ததாதிதீதுதூதெதேதைதொதோதௌந்நநாநிநீநுநூநெநேநைநொநோநௌப்பபாபிபீபுபூபெபேபைபொபோபௌம்மமாமிமீமுமூமெமேமைமொமோமௌய்யயாயியீயுயூயெயேயையொயோயௌர்ரராரிரீருரூரெரேரைரொரோரௌல்லலாலிலீலுலூலெலேலைலொலோலௌவ்வவாவிவீவுவூவெவேவைவொவோவௌழ்ழழாழிழீழுழூழெழேழைழொழோழௌள்ளளாளிளீளுளூளெளேளைளொளோளௌற்றறாறிறீறுறூறெறேறைறொறோறௌன்னனானினீனுனூனெனேனைனொனோனௌஶ்ஶஶாஶிஶீஶுஶூஶெஶேஶைஶொஶோஶௌஜ்ஜஜாஜிஜீஜுஜூஜெஜேஜைஜொஜோஜௌஷ்ஷஷாஷிஷீஷுஷூஷெஷேஷைஷொஷோஷௌஸ்ஸஸாஸிஸீஸுஸூஸெஸேஸைஸொஸோஸௌஹ்ஹஹாஹிஹீஹுஹூஹெஹேஹைஹொஹோஹௌக்ஷ்க்ஷக்ஷாக்ஷிக்ஷீக்ஷுக்ஷூக்ஷெக்ஷேக்ஷைஷொக்ஷோஷௌ";
List<String> characters = new ArrayList<String>();
Pattern pat = Pattern.compile("\u0B95\u0BCD\u0BB7\\p{M}?|\\p{L}\\p{M}?");
Matcher matcher = pat.matcher(s);
while (matcher.find()) {
    characters.add(matcher.group());
}

System.out.println(characters);
System.out.println(characters.size() == 325);

Normalizer 클래스를 살펴보십시오 . 문제의 원인에 대한 설명이 있습니다. 유니 코드에서는 다음과 같은 여러 방법으로 문자를 인코딩 할 수 있습니다 Á.

  U+00C1    LATIN CAPITAL LETTER A WITH ACUTE

또는

  U+0041    LATIN CAPITAL LETTER A
  U+0301    COMBINING ACUTE ACCENT

를 사용 Normalizer하여 문자열을 구성된 형식으로 변환 한 다음 문자를 반복 할 수 있습니다.

편집 : 위의 @halex가 제안한 기사를 기반으로 Java에서 시도하십시오.

    String str = new String("குமார்");

    ArrayList<String> characters = new ArrayList<String>();
    str = Normalizer.normalize(str, Form.NFC);
    StringBuilder charBuffer = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        int codePoint = str.codePointAt(i);
        int category = Character.getType(codePoint);
        if (charBuffer.length() > 0
                && category != Character.NON_SPACING_MARK
                && category != Character.COMBINING_SPACING_MARK
                && category != Character.CONTROL
                && category != Character.OTHER_SYMBOL) {
            characters.add(charBuffer.toString());
            charBuffer.delete(0, charBuffer.length());
        }
        charBuffer.appendCodePoint(codePoint);
    }
    if (charBuffer.length() > 0) {
        characters.add(charBuffer.toString());
    }
    System.out.println(characters);

내가 얻은 결과는 [கு, மா, ர்]. 모든 문자열에서 작동하지 않는 경우 if블록의 다른 유니 코드 문자 범주를 사용해보십시오 .

이것은 정말 추한 것으로 판명되었습니다 .... 문자열을 디버깅했으며 다음 문자 (및 16 진수 위치)를 포함합니다.

க 0x0b95
ு 0x0bc1
ம 0x0bae
ா 0x0bbe
ர 0x0bb0
் 0x0bcd

So tamil language obviously use diacritics-like sequences to get all characters which unfortunately count as separate entities.

This is not a problem with UTF-8 / UTF-16 as erronously claimed by other answers, it is inherent in the Unicode encoding of the Tamil language.

The suggested Normalizer does not work, it seems that tamil has been designed by Unicode "experts" to explicitly use combination sequences which cannot be normalized. Aargh.

My next idea is not to count characters, but glyphs, the visual representations of characters.

String str1 = new String(Normalizer.normalize("குமார்", Normalizer.Form.NFC ));

Font display = new Font("SansSerif",Font.PLAIN,12);
GlyphVector vec = display.createGlyphVector(new FontRenderContext(new AffineTransform(),false, false),str1);

System.out.println(vec.getNumGlyphs());
for (int i=0; i<str1.length(); i++)
        System.out.printf("%s %s %s %n",str1.charAt(i),Integer.toHexString((int) str1.charAt(i)),vec.getGlyphVisualBounds(i).getBounds2D().toString());

The result:

க b95 [x=0.0,y=-6.0,w=7.0,h=6.0]
ு bc1 [x=8.0,y=-6.0,w=7.0,h=4.0]
ம bae [x=17.0,y=-6.0,w=6.0,h=6.0]
ா bbe [x=23.0,y=-6.0,w=5.0,h=6.0]
ர bb0 [x=30.0,y=-6.0,w=4.0,h=8.0]
் bcd [x=31.0,y=-9.0,w=1.0,h=2.0]

As the glyphs are intersecting, you need to use Java character type functions like in the other solution.

SOLUTION:

I am using this link: http://www.venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf

public static int getTamilStringLength(String tamil) {
    int dependentCharacterLength = 0;
    for (int index = 0; index < tamil.length(); index++) {
        char code = tamil.charAt(index);
        if (code == 0xB82)
            dependentCharacterLength++;
        else if (code >= 0x0BBE && code <= 0x0BC8)
            dependentCharacterLength++;
        else if (code >= 0x0BCA && code <= 0x0BD7)
            dependentCharacterLength++;
    }
    return tamil.length() - dependentCharacterLength;
  }

You need to exclude the combination characters and count them accordingly.

As has been mentioned, your string contains 6 distinct code points. Half of them are letters, the other half are vowel signs. (Combining marks)

You could use transformations built into the ICU4J library, to remove all of the vowel signs which are not Letters using the rule:

[:^Letter:] Remove

and count the resulting string. Try it out on their demo site:

http://demo.icu-project.org/icu-bin/translit

I wouldn't display the resultant string to an end user, and I'm not an expert so the rules may need to be tweaked to get to the general case but it's a thought.

This is the new way to calculate the length of a Java String taking into account the Unicode characters.

int unicodeLength = str.codePointCount(0, str.length);

ReferenceURL : https://stackoverflow.com/questions/15947992/java-unicode-string-length

'IT Share you' 카테고리의 다른 글

TeamCity에서 dotCover가 적용되지 않도록 유형 및 방법을 제외하려면 어떻게합니까? (0)	2021.01.08
Doxygen에서 매개 변수를 참조하는 올바른 방법은 무엇입니까? (0)	2021.01.08
AngularJS 지시문에 속성을 통해 배열 전달 (0)	2021.01.08
neo4j 데이터베이스를 재설정 / 삭제 / 삭제하는 방법은 무엇입니까? (0)	2021.01.08
파이썬 피클 프로토콜 선택? (0)	2021.01.08

현재글자바 유니 코드 문자열 길이

shareyou

자바 유니 코드 문자열 길이

자바 유니 코드 문자열 길이

'IT Share you' 카테고리의 다른 글

'IT Share you'의 다른글

티스토리툴바

자바 유니 코드 문자열 길이

자바 유니 코드 문자열 길이

'IT Share you' 카테고리의 다른 글

'IT Share you'의 다른글

관련글

티스토리툴바